Mlib: TF-IDF Computation Improvement

2016-12-14 Thread Reth RM
Hi, Is my understanding correct that, right now, the way TF-IDF is computed is 3 steps. 1) Apply HashingTF on records and generate TF vectors. 2) Then IDF model is created with input TF vectors - which calculates DF(document frequencies of each term), 3) Finally TF vectors are transformed to

How to get recent value in spark dataframe

2016-12-14 Thread Milin korath
Hi I have a spark data frame with following structure id flag price date a 0100 2015 a 050 2015 a 1200 2014 a 1300 2013 a 0400 2012 I need to create a data frame with recent value of flag 1 and updated in the flag 0 rows. id flag price

Re: Belief propagation algorithm is open sourced

2016-12-14 Thread Bryan Cutler
I'll check it out, thanks for sharing Alexander! On Dec 13, 2016 4:58 PM, "Ulanov, Alexander" wrote: > Dear Spark developers and users, > > > HPE has open sourced the implementation of the belief propagation (BP) > algorithm for Apache Spark, a popular message passing

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-14 Thread Akhilesh Pathodia
If you have enough cores/resources, run them separately depending on your use case. On Thursday 15 December 2016, Divya Gehlot wrote: > It depends on the use case ... > Spark always depends on the resource availability . > As long as you have resource to acoomodate ,can

Re: spark reshape hive table and save to parquet

2016-12-14 Thread Divya Gehlot
you can use udfs to do it http://stackoverflow.com/questions/31615657/how-to-add-a-new-struct-column-to-a-dataframe Hope it will help. Thanks, Divya On 9 December 2016 at 00:53, Anton Kravchenko wrote: > Hello, > > I wonder if there is a way (preferably

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-14 Thread Divya Gehlot
It depends on the use case ... Spark always depends on the resource availability . As long as you have resource to acoomodate ,can run as many spark/spark streaming application. Thanks, Divya On 15 December 2016 at 08:42, shyla deshpande wrote: > How many Spark

How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-14 Thread shyla deshpande
How many Spark streaming applications can be run at a time on a Spark cluster? Is it better to have 1 spark streaming application to consume all the Kafka topics or have multiple streaming applications when possible to keep it simple? Thanks

Re: spark reshape hive table and save to parquet

2016-12-14 Thread Anton Kravchenko
I am looking for something like: # prepare input data val input_schema = StructType(Seq( StructField("col1", IntegerType), StructField("col2", IntegerType), StructField("col3", IntegerType))) val input_data = spark.createDataFrame( sc.parallelize(Seq( Row(1, 2, 3),

TaskSetManager stalls for 1 min in the middle of a job

2016-12-14 Thread Oleg Mazurov
Having submitted three tasks at level PROCESS_LOCAL TaskSetManager moves to next locality level and gets stuck there for 60 sec. That level is not empty but it appears it contains same tasks already submitted and successfully executed, which leads to a stall until the corresponding timeout

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Vaibhav Sinha
Hi, I see a similar behaviour in an exactly similar scenario at my deployment as well. I am using scala, so the behaviour is not limited to pyspark. In my observation 9 out of 10 partitions (as in my case) are of similar size ~38 GB each and final one is significantly larger ~59 GB. Prime number

Handling Exception or Control in spark dataframe write()

2016-12-14 Thread bhayat
Hello, I am writing my RDD into parquet format but what i understand that write() method is still experimental and i do not know how i will deal with possible exceptions. For example: schemaXXX.write().mode(saveMode).parquet(parquetPathInHdfs); In this example i do not know how i will handle

Hortonworks Spark Certification Query

2016-12-14 Thread Aakash Basu
Hi all, Is there anyone who wrote the HDPCD examination as in the below link? http://hortonworks.com/training/certification/exam-objectives/#hdpcdspark I'm going to sit for this with a very little time to prepare, can I please be helped with the questions to expect and their probable solutions?

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho
Hello, We have done some test in here, and it seems that when we use prime number of partitions the data is more spread. This has to be with the hashpartitioning and the Java Hash algorithm. I don't know how your data is and how is this in python, but if you (can) implement a partitioner, or

Proper use of map(..., Encoder)

2016-12-14 Thread Brad Cox
Here's a fragment of code that intends to convert a Dataset of features into a Vector of Doubles for use as the features column for SparkML's DecisionTree algorithm. My current problem is the .map() operation, which refuses to compile with an eclipse error "The method map(Function1,

Re: Graphx triplet comparison

2016-12-14 Thread Robineast
You are trying to invoke 1 RDD action inside another, that won't work. If you want to do what you are attempting you need to .collect() each triplet to the driver and iterate over that. HOWEVER you almost certainly don't want to do that, not if your data are anything other than a trivial size. In

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Adrian Bridgett
Since it's pyspark it's just using the default hash partitioning I believe. Trying a prime number (71 so that there's enough CPUs) doesn't seem to change anything. Out of curiousity why did you suggest that? Googling "spark coalesce prime" doesn't give me any clue :-) Adrian On 14/12/2016

Re: coalesce ending up very unbalanced - but why?

2016-12-14 Thread Dirceu Semighini Filho
Hi Adrian, Which kind of partitioning are you using? Have you already tried to coalesce it to a prime number? 2016-12-14 11:56 GMT-02:00 Adrian Bridgett : > I realise that coalesce() isn't guaranteed to be balanced and adding a > repartition() does indeed fix this (at the

coalesce ending up very unbalanced - but why?

2016-12-14 Thread Adrian Bridgett
I realise that coalesce() isn't guaranteed to be balanced and adding a repartition() does indeed fix this (at the cost of a large shuffle. I'm trying to understand _why_ it's so uneven (hopefully it helps someone else too). This is using spark v2.0.2 (pyspark). Essentially we're just

can we unite the UI among different standaone clusters' UI?

2016-12-14 Thread John Fang
As we know, each standaone cluster has itself UI. Then we will have more than one UI if we have many standalone cluster. How can I only have a UI which can access different standaone clusters?

Does Spark 2.0.2 sql syntax not support hive sql syntax any more?

2016-12-14 Thread qmzhang
I upgrade spark cluster from 1.6.2 to spark 2.0.2 and test spark2 sql syntax.I found some grammar that spark 2.0.2 not support.but it work in spark 1.6.2. Hive metastore version is 1.2.1. such as: 1、ALTER TABLE table_name ADD COLUMNS(m_id STRING); spark 2.0.2 throw an exception :Operation not

Unsubscribe

2016-12-14 Thread Mostafa Alaa Mohamed
Unsubscribe Best Regards, Mostafa Alaa Mohamed, Technical Expert Big Data, M: +971506450787 Email: mohamedamost...@etisalat.ae -Original Message- From: balaji9058 [mailto:kssb...@gmail.com] Sent: Wednesday, December 14, 2016 08:32 AM To: user@spark.apache.org Subject: Re: Graphx triplet