Re: Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar
This worked. Thanks for the tip Michael. Thanks, Muthu On Thu, Feb 16, 2017 at 12:41 PM, Michael Armbrust wrote: > The toString method of Dataset.queryExecution includes the various plans. > I usually just log that directly. > > On Thu, Feb 16, 2017 at 8:26 AM, Muthu

Spark Worker can't find jar submitted programmatically

2017-02-16 Thread jeremycod
Hi,I'm trying to create application that would programmatically submit jar file to Spark standalone cluster running on my local PC. However, I'm always getting the error WARN TaskSetManager:66 - Lost task 1.0 in stage 0.0 (TID 1, 192.168.2.68, executor 0): java.lang.RuntimeException: Stream

Re: Can't load a RandomForestClassificationModel in Spark job

2017-02-16 Thread Russell Jurney
When you say workers, are you using Spark Streaming? I'm not sure if this will help, but there is an example of deploying a RandomForestClassificationModel in Spark Streaming against Kafka that uses createDataFrame here:

RE: Can't load a RandomForestClassificationModel in Spark job

2017-02-16 Thread Jianhong Xia
Thanks Hollin. I will take a look at mleap and will let you know if I have any questions. Jianhong From: Hollin Wilkins [mailto:hol...@combust.ml] Sent: Tuesday, February 14, 2017 11:48 PM To: Jianhong Xia Cc: Sumona Routh ; ayan guha

Re: Remove .HiveStaging files

2017-02-16 Thread Xiao Li
Maybe you can check this PR? https://github.com/apache/spark/pull/16399 Thanks, Xiao 2017-02-15 15:05 GMT-08:00 KhajaAsmath Mohammed : > Hi, > > I am using spark temporary tables to write data back to hive. I have seen > weird behavior of .hive-staging files after

Spark standalone cluster on EC2 error .. Checkpoint..

2017-02-16 Thread shyla deshpande
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /checkpoint/11ea8862-122c-4614-bc7e-f761bb57ba23/rdd-347/.part-1-attempt-3 could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this

Re: Debugging Spark application

2017-02-16 Thread Md. Rezaul Karim
Thanks, Sam. I will have a look at it. On Feb 16, 2017 10:06 PM, "Sam Elamin" wrote: > I recommend running spark in local mode when your first debugging your > code just to understand what's happening and step through it, perhaps catch > a few errors when you first

Re: Debugging Spark application

2017-02-16 Thread Sam Elamin
I recommend running spark in local mode when your first debugging your code just to understand what's happening and step through it, perhaps catch a few errors when you first start off I personally use intellij because it's my preference You can follow this guide.

Debugging Spark application

2017-02-16 Thread Md. Rezaul Karim
Hi, I was looking for some URLs/documents for getting started on debugging Spark applications. I prefer developing Spark applications with Scala on Eclipse and then package the application jar before submitting. Kind regards, Reza

Re: Pretty print a dataframe...

2017-02-16 Thread Michael Armbrust
The toString method of Dataset.queryExecution includes the various plans. I usually just log that directly. On Thu, Feb 16, 2017 at 8:26 AM, Muthu Jayakumar wrote: > Hello there, > > I am trying to write to log-line a dataframe/dataset queryExecution and/or > its logical

Spark on Mesos with Docker in bridge networking mode

2017-02-16 Thread cherryii
I'm getting errors when I try to run my docker container in bridge networking mode on mesos. Here is my spark submit script /spark/bin/spark-submit \ --class com.package.MySparkJob \ --name My-Spark-Job \ --files /path/config.cfg, ${JAR} \ --master ${SPARK_MASTER_HOST} \ --deploy-mode

Latent Dirichlet Allocation in Spark

2017-02-16 Thread Manish Tripathi
Hi I am trying to do topic modeling in Spark using Spark's LDA package. Using Spark 2.0.2 and pyspark API. I ran the code as below: *from pyspark.ml.clustering import LDA* *lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")* *ldaModel=lda.fit(tf_df)*

Will Spark ever run the same task at the same time

2017-02-16 Thread Ji Yan
Dear spark users, Is there any mechanism in Spark that does not guarantee the idempotent nature? For example, for stranglers, the framework might start another task assuming the strangler is slow while the strangler is still running. This would be annoying sometime when say the task is writing to

Re: skewed data in join

2017-02-16 Thread Gourav Sengupta
Hi, Thanks for your kind response. The hash key using random numbers increases the time for processing the data. My entire join for the entire month finishes within 150 seconds for 471 million records and then stays for another 6 mins for 55 million records. Using hash keys increases the

scala.io.Source.fromFile protocol for hadoop

2017-02-16 Thread nancy henry
Hello, hiveSqlContext.sql(scala.io.Source.fromFile(args(0).toString()).mkString).collect() I have a file in my local system and i am spark-submit deploy mode cluster on hadoop so args(0) should be on hadoop cluster or local? what should be the protocol file:/// for hadoop what is the

Re: skewed data in join

2017-02-16 Thread Anis Nasir
You can also so something similar to what is mentioned in [1]. The basic idea is to use two hash functions for each key and assigning it to the least loaded of the two hashed worker. Cheers, Anis [1].

Re: skewed data in join

2017-02-16 Thread Yong Zhang
Yes. You have to change your key, or as BigData term, "adding salt". Yong From: Gourav Sengupta Sent: Thursday, February 16, 2017 11:11 AM To: user Subject: skewed data in join Hi, Is there a way to do multiple reducers for joining

Pretty print a dataframe...

2017-02-16 Thread Muthu Jayakumar
Hello there, I am trying to write to log-line a dataframe/dataset queryExecution and/or its logical plan. The current code... def explain(extended: Boolean): Unit = { val explain = ExplainCommand(queryExecution.logical, extended = extended)

skewed data in join

2017-02-16 Thread Gourav Sengupta
Hi, Is there a way to do multiple reducers for joining on skewed data? Regards, Gourav

Pyspark: out of memory exception during model training

2017-02-16 Thread mzaharchenko
My problem is quite simple - JVM is running out of memory during model = dt.fit(train_small). My train_small dataset contains only 100 rows(I have limited the number of rows to make sure the size of dataset doesn't cause the memory overflow). But each row has a column all_features with a long

Re: Enrichment with static tables

2017-02-16 Thread Gaurav Agarwal
Thanks That worked for me previously I was using wrong join .that the reason it did Not worked for me Tbanks On Feb 16, 2017 01:20, "Sam Elamin" wrote: > You can do a join or a union to combine all the dataframes to one fat > dataframe > > or do a select on the columns