need help to have a Java version of this scala script

2016-12-16 Thread Richard Xin
what I am trying to do:I need to add column (could be complicated transformation based on value of a column) to a give dataframe. scala script:val hContext = new HiveContext(sc) import hContext.implicits._ val df = hContext.sql("select x,y,cluster_no from test.dc") val len = udf((str: String) =>

Running Multiple Versions of Spark on the same cluster (YARN)

2016-12-16 Thread Jorge Machado
Hi Everyone, I have one question : is it possible to run like on HDP Spark 1.6.1 and then run Spark 2.0.0 inside of it ? Like passing the spark libs with —jars ? The Ideia behind it is not to need to use the default installation of HDP and be able to deploy new versions of spark quickly.

Regarding Connection Problem

2016-12-16 Thread Chintan Bhatt
Hi I want to give continuous output (avg. temperature) generated from node.js to store on Hadoop and then retrieve it for visualization. please guide me how to give continuous output of node.js to kafka. -- CHINTAN BHATT --

Spark GraphFrames generic question

2016-12-16 Thread Ankur Srivastava
Hi I am working on two different use cases where the basic problem is same but scale is very different. In case 1 we have two entities that can have many to many relation and we would want to identify all subgraphs in the full graph and then further prune the graph to find the best relation.

Re: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread vaquar khan
Hi Kant, Hope following information will help . 1)Cluster https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-standalone.html http://spark.apache.org/docs/latest/hardware-provisioning.html 2) Yarn vs Mesos https://www.linkedin.com/pulse/mesos-compare-yarn-vaquar-

Re: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread kant kodali
Hi Saif, What do you mean by small cluster? Any specific size? Also can you shine some light on how YARN takes a win over mesos? Thanks, kant On Fri, Dec 16, 2016 at 10:45 AM, wrote: > In my experience, Standalone works very well in small cluster where there >

Re: Issue: Skew on Dataframes while Joining the dataset

2016-12-16 Thread vaquar khan
That kind of issue SparkUI and DAG visualization always helpful. https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html Regards, Vaquar khan On Fri, Dec 16, 2016 at 11:10 AM, Vikas K. wrote: > Unsubscribe. > > On Fri,

Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Oh and to get the null for missing years, you'd need to do an outer join with a table containing all of the years you are interested in. On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust wrote: > Are you looking for argmax? Here is an example >

Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Are you looking for argmax? Here is an example . On Wed, Dec 14, 2016 at 8:49 PM, Milin korath wrote: > Hi

Re: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread Anant Chintamaneni
+1 Sent from my iPhone > On Dec 16, 2016, at 10:45 AM, > wrote: > > In my experience, Standalone works very well in small cluster where there > isn’t anything else running. > > Bigger cluster or shared resources,

RE: Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread Saif.A.Ellafi
In my experience, Standalone works very well in small cluster where there isn’t anything else running. Bigger cluster or shared resources, YARN takes a win, surpassing the overhead of spawning containers as opposed to a background running worker. Best is if you try both, if standalone is good

Re: How to reflect dynamic registration udf?

2016-12-16 Thread Cheng Lian
Could you please provide more context about what you are trying to do here? On Thu, Dec 15, 2016 at 6:27 PM 李斌松 wrote: > How to reflect dynamic registration udf? > > java.lang.UnsupportedOperationException: Schema for type _$13 is not > supported > at >

Re: Issue: Skew on Dataframes while Joining the dataset

2016-12-16 Thread Vikas K.
Unsubscribe. On Fri, Dec 16, 2016 at 9:21 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am facing an issue with join operation on dataframe. My job is running > for very long time( > 2 hrs ) without any result. can someone help me on > how to resolve. > > I tried

Unsubscribe

2016-12-16 Thread krishna ramachandran
Unsubscribe

Mesos Spark Fine Grained Execution - CPU count

2016-12-16 Thread Chawla,Sumit
Hi I am using Spark 1.6. I have one query about Fine Grained model in Spark. I have a simple Spark application which transforms A -> B. Its a single stage application. To begin the program, It starts with 48 partitions. When the program starts running, in mesos UI it shows 48 tasks and 48 CPUs

Re: Issue: Skew on Dataframes while Joining the dataset

2016-12-16 Thread KhajaAsmath Mohammed
Hi, I am able to resolve this issue. Culprit was the SQL query adding one more join returned records in less time. Thanks, Asmath On Fri, Dec 16, 2016 at 9:51 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I am facing an issue with join operation on dataframe. My job is

Re: Spark Batch checkpoint

2016-12-16 Thread Chawla,Sumit
sorry for hijacking this thread. @irving, how do you restart a spark job from checkpoint? Regards Sumit Chawla On Fri, Dec 16, 2016 at 2:24 AM, Selvam Raman wrote: > Hi, > > Acutally my requiremnt is read the parquet file which is 100 partition. > Then i use

Issue: Skew on Dataframes while Joining the dataset

2016-12-16 Thread KhajaAsmath Mohammed
Hi, I am facing an issue with join operation on dataframe. My job is running for very long time( > 2 hrs ) without any result. can someone help me on how to resolve. I tried re-partition with 13 but no luck. val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable

Re: Negative values of predictions in ALS.tranform

2016-12-16 Thread Manish Tripathi
Thanks a bunch. That's very helpful. On Friday, December 16, 2016, Sean Owen wrote: > That all looks correct. > > On Thu, Dec 15, 2016 at 11:54 PM Manish Tripathi > wrote: > >> ok. Thanks. So here is

Re: Why foreachPartition function make duplicate invocation to map function for every message ? (Spark 2.0.2)

2016-12-16 Thread Cody Koeninger
Please post a minimal complete code example of what you are talking about On Thu, Dec 15, 2016 at 6:00 PM, Michael Nguyen wrote: > I have the following sequence of Spark Java API calls (Spark 2.0.2): > > Kafka stream that is processed via a map function, which returns

Re: coalesce ending up very unbalanced - but why?

2016-12-16 Thread vaquar khan
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-rdd-partitions.html Regards, vaquar khan On Wed, Dec 14, 2016 at 12:15 PM, Vaibhav Sinha wrote: > Hi, > I see a similar behaviour in an exactly similar scenario at my deployment > as well. I am using

java.lang.RuntimeException: Stream '/jars/' not found

2016-12-16 Thread Hanumath Rao Maduri
Hello All, I am trying to test an application on standalone cluster. Here is my scenario. I started a spark master on a node A and also 1 worker on the same node A. I am trying to run the application from node B(this means I think this acts as driver). I have added jars to the sparkconf using

unsuscribe

2016-12-16 Thread Javier Rey

Re: How to get recent value in spark dataframe

2016-12-16 Thread vaquar khan
Not sure about your logic 0 and 1 but you can use orderBy the data according to time and get the first value. Regards, Vaquar khan On Wed, Dec 14, 2016 at 10:49 PM, Milin korath wrote: > Hi > > I have a spark data frame with following structure > > id flag price

Re: Spark dump in slave Node EMR

2016-12-16 Thread Selvam Raman
If i want to take specifically for the task number which got failed. is it possible to take heap dump. "16/12/16 12:25:54 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 20.0 GB of 19.8 GB physical memory used. Consider boosting

Spark dump in slave Node EMR

2016-12-16 Thread Selvam Raman
Hi, how can i take heap dump in EMR slave node to analyze. I have one master and two slave. if i enter jps command in Master, i could see sparksubmit with pid. But i could not see anything in slave node. how can i take heap dump for spark job. -- Selvam Raman "லஞ்சம் தவிர்த்து நெஞ்சம்

Re: Gradle dependency problem with spark

2016-12-16 Thread Steve Loughran
FWIW, although the underlying Hadoop declared guava dependency is pretty low, everything in org.apache.hadoop is set up to run against later versions. It just sticks with the old one to avoid breaking anything donwstream which does expect a low version number. See HADOOP-10101 for the ongoing

Re: Handling Exception or Control in spark dataframe write()

2016-12-16 Thread Steve Loughran
> On 14 Dec 2016, at 18:10, bhayat wrote: > > Hello, > > I am writing my RDD into parquet format but what i understand that write() > method is still experimental and i do not know how i will deal with possible > exceptions. > > For example: > >

Re: Gradle dependency problem with spark

2016-12-16 Thread Sean Owen
Yes, that's the problem. Guava isn't generally mutually compatible across more than a couple major releases. You may have to hunt for a version that happens to have the functionality that both dependencies want, and hope that exists. Spark should shade Guava at this point but doesn't mean that you

Re: Gradle dependency problem with spark

2016-12-16 Thread kant kodali
I replaced *guava-14.0.1.jar* with *guava-19.0.jar in *SPARK_HOME/jars and seem to work ok but I am not sure if it is the right thing to do. My fear is that if Spark uses features from Guava that are only present in 14.0.1 but not in 19.0 I guess my app will break. On Fri, Dec 16, 2016 at 2:22

Re: Dataset encoders for further types?

2016-12-16 Thread Jakub Dubovsky
I will give that a try. Thanks! On Fri, Dec 16, 2016 at 12:45 AM, Michael Armbrust wrote: > I would have sworn there was a ticket, but I can't find it. So here you > go: https://issues.apache.org/jira/browse/SPARK-18891 > > A work around until that is fixed would be for

Re: Spark Batch checkpoint

2016-12-16 Thread Selvam Raman
Hi, Acutally my requiremnt is read the parquet file which is 100 partition. Then i use foreachpartition to read the data and process it. My sample code public static void main(String[] args) { SparkSession sparkSession = SparkSession.builder().appName("checkpoint verification").getOrCreate();

Gradle dependency problem with spark

2016-12-16 Thread kant kodali
Hi Guys, Here is the simplified version of my problem. I have the following problem and I new to gradle dependencies { compile group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.0.2' compile group: 'com.github.brainlag', name: 'nsq-client', version: '1.0.0.RC2' } I took

Re: Negative values of predictions in ALS.tranform

2016-12-16 Thread Sean Owen
That all looks correct. On Thu, Dec 15, 2016 at 11:54 PM Manish Tripathi wrote: > ok. Thanks. So here is what I understood. > > Input data to Als.fit(implicitPrefs=True) is the actual strengths (count > data). So if I have a matrix of (user,item,views/purchases) I pass that

What is the deployment model for Spark Streaming? A specific example.

2016-12-16 Thread Russell Jurney
I have created a PySpark Streaming application that uses Spark ML to classify flight delays into three categories: on-time, slightly late, very late. After an hour or so something times out and the whole thing crashes. The code and error are on a gist here:

Do we really need mesos or yarn? or is standalone sufficent?

2016-12-16 Thread kant kodali
Do we really need mesos or yarn? or is standalone sufficient for production systems? I understand the difference but I don't know the capabilities of standalone cluster. does anyone have experience deploying standalone in the production?