Re: spark df schema to hive schema converter func

2016-08-06 Thread Sumit Khanna
wrt https://issues.apache.org/jira/browse/SPARK-5236. How do I also, usually convert something of type DecimalType to int/ string/ etc etc. Thanks, On Sun, Aug 7, 2016 at 10:33 AM, Sumit Khanna wrote: > Hi, > > was wondering if we have something like that takes as an

spark df schema to hive schema converter func

2016-08-06 Thread Sumit Khanna
Hi, was wondering if we have something like that takes as an argument a spark df type e.g DecimalType(12,5) and converts it into the corresponding hive schema type. Double / Decimal / String ? Any ideas. Thanks,

Re: Dropping late date in Structured Streaming

2016-08-06 Thread Matei Zaharia
Yes, a built-in mechanism is planned in future releases. You can also drop it using a filter for now but the stateful operators will still keep state for old windows. Matei > On Aug 6, 2016, at 9:40 AM, Amit Sela wrote: > > I've noticed that when using Structured

Re: Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar
Hello Dr Mich Talebzadeh, >Can you kindly advise on your number of nodes, the cores for each node and the RAM for each node. I have a 32 node (1 executor per node currently) cluster. All these have 512 GB of memory. Most of these are either 16 or 20 physical cores (with out HT enabled). The HDFS

Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread Ted Yu
I searched *Suite.scala and found only the following contains some classes extending Transformer : ./mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala But HasInputCol is not used. FYI On Sat, Aug 6, 2016 at 11:01 AM, janardhan shetty wrote: > Yes seems

Spark Application Counters Using Rest API

2016-08-06 Thread Muhammad Haris
Hi, Could anybody please guide me how to get application or job level counters for CPU and Memory in Spark 2.0.0 using REST API. I have explored the API's at http://spark.apache.org/docs/latest/monitoring.html but did not find anything similar to what MR provides, see the link below: (

Dataframe / Dataset partition size...

2016-08-06 Thread Muthu Jayakumar
Hello there, I am trying to understand how I could improve (or increase) the parallelism of tasks that run for a particular spark job. Here is my observation... scala> spark.read.parquet("hdfs://somefile").toJavaRDD.partitions.size() 25 > hadoop fs -ls hdfs://somefile | grep 'part-r' | wc -l

Re: Explanation regarding Spark Streaming

2016-08-06 Thread Mich Talebzadeh
Thanks. This is very confusing as the thread owner question does not specify whether there is windowing operations or not. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
According to the docs for Spark Streaming, the default for data received through receivers is MEMORY_AND_DISK_SER_2. If windowing operations are performed, RDDs are persisted with StorageLevel.MEMORY_ONLY_SER.

Re: Explanation regarding Spark Streaming

2016-08-06 Thread Mich Talebzadeh
Hi, I think the default storage level is MEMORY_ONLY HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread janardhan shetty
Yes seems like, wondering if this can be made public in order to develop custom transformers or any other alternatives ? On Sat, Aug 6, 2016 at 10:07 AM, Ted Yu wrote: > Is it because HasInputCol is private ? > > private[ml] trait HasInputCol extends Params { > > On Thu,

Long running tasks in stages

2016-08-06 Thread Deepak Sharma
I am doing join over 1 dataframe and a empty data frame. The first dataframe got almost 50k records. This operation nvere returns back and runs indefinitely. Is there any solution to get around this? -- Thanks Deepak www.bigdatabig.com www.keosha.net

Help testing the Spark Extensions for the Apache Bahir 2.0.0 release

2016-08-06 Thread Luciano Resende
Apache Bahir is voting it's 2.0.0 release based on Apache Spark 2.0.0. https://www.mail-archive.com/dev@bahir.apache.org/msg00312.html We appreciate any help reviewing/testing the release, which contains the following Apache Spark extensions: Akka DStream connector MQTT DStream connector

RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
Hi Jacek, Yes, I am assuming that data streams in consistently at the same rate (for example, 100MB/s). BTW, even if the persistence level for streaming data is set to MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data will spill to disk. That will make the application

Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread Ted Yu
Is it because HasInputCol is private ? private[ml] trait HasInputCol extends Params { On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty wrote: > Version : 2.0.0-preview > > import org.apache.spark.ml.param._ > import org.apache.spark.ml.param.shared.{HasInputCol,

Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread janardhan shetty
Any thoughts or suggestions on this error? On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty wrote: > Version : 2.0.0-preview > > import org.apache.spark.ml.param._ > import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} > > > class

Re: Kmeans dataset initialization

2016-08-06 Thread Tony Lane
Can anyone suggest how I can initialize kmeans structure directly from a dataset of Row On Sat, Aug 6, 2016 at 1:03 AM, Tony Lane wrote: > I have all the data required for KMeans in a dataset in memory > > Standard approach to load this data from a file is >

Dropping late date in Structured Streaming

2016-08-06 Thread Amit Sela
I've noticed that when using Structured Streaming with event-time windows (fixed/sliding), all windows are retained. This is clearly how "late" data is handled, but I was wondering if there is some pruning mechanism that I might have missed ? or is this planned in future releases ? Thanks, Amit

Re: Explanation regarding Spark Streaming

2016-08-06 Thread Mich Talebzadeh
The thread owner question is Q1. What will happen if spark streaming job have batchDurationTime as 60 sec and processing time of complete pipeline is greater then 60 sec. " This basically means that you will gradually building a backlog and regardless of whether you are going to blow up the

Re: submitting spark job with kerberized Hadoop issue

2016-08-06 Thread Wojciech Pituła
What I can say, is that we successfully use spark on yarn with kerberized cluster. One of my coworkers also tried using it in the same way as you are(spark standalone with kerberized cluster) but as far as I remember, he didn't succeed. I may be wrong, because I was not personally involved in this

Re: submitting spark job with kerberized Hadoop issue

2016-08-06 Thread Jacek Laskowski
Hi Aneela, I don't really know. I've never been using (or even toying with) Spark Standalone to access a secured HDFS cluster. I however think the settings won't work since they are for Spark on YARN (I would not be surprised to know that it is not supported outside Spark on YARN). Pozdrawiam,

Re: submitting spark job with kerberized Hadoop issue

2016-08-06 Thread Aneela Saleem
Hi Jacek! Thanks for your response. I am using spark standalone. I have secured hadoop cluster, Can you please guide me wha to do if i want to access hadoop in my spark job? Thanks On Sat, Aug 6, 2016 at 12:34 AM, Jacek Laskowski wrote: > Just to make things clear...are you

Re: Avoid Cartesian product in calculating a distance matrix?

2016-08-06 Thread Sonal Goyal
The general approach to the Cartesian problem is to first block or index your rows so that similar items fall in the same bucket, and then join within each bucket. Is that possible in your case? On Friday, August 5, 2016, Paschalis Veskos wrote: > Hello everyone, > > I am

Re: Avoid Cartesian product in calculating a distance matrix?

2016-08-06 Thread Yann-Aël Le Borgne
Hi I also experienced very slow computation times for the cartesian product, and could not find an efficient way to do this apart from doing my own implementation. I used the 'balanced' cluster algorithm described here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/ I'd be interested to

RE: Explanation regarding Spark Streaming

2016-08-06 Thread Jacek Laskowski
Hi, Thanks for explanation, but it does not prove Spark will OOM at some point. You assume enough data to store but there could be none. Jacek On 6 Aug 2016 4:23 a.m., "Mohammed Guller" wrote: > Assume the batch interval is 10 seconds and batch processing time is 30 >

mapWithState handle timeout

2016-08-06 Thread 李剑
I go an error: Cannot update the state that is timing out Because I set the timeout: val newStateDstream = newActionDstream.mapWithState(StateSpec.function(mappingFunc).timeout(Seconds(3600)).initialState(initialRDD)) In the spark code :

Re: mapWithState handle timeout

2016-08-06 Thread jackerli
any ideas? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/mapWithState-handle-timeout-tp27422p27489.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To