RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Hora
Finally I tried setting the configuration manually using sc.hadoopconfiguration.set dfs.nameservices dfs.ha.namenodes.hdpha dfs.namenode.rpc-address.hdpha.n1 And it worked ,don't know why it was not reading these settings from file under HADOOP_CONF_DIR -Original Message- Fr

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Hora
There are DNS entries for both of my namenode Ambarimaster is standby and it resolves to ip perfectly Hdp231 is active and it also resolves to ip Hdpha is my Hadoop HA cluster name And hdfs-site.xml has entries related to these configuration -Original Message- From: "Jörn Franke" Sent: ‎4

Re: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Jörn Franke
Is the host in /etc/hosts ? > On 13 Apr 2016, at 07:28, Amit Singh Hora wrote: > > I am trying to access directory in Hadoop from my Spark code on local > machine.Hadoop is HA enabled . > > val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]") > val sc=new SparkContext(conf)

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Singh Hora
This property already exists. -Original Message- From: "ashesh_28 [via Apache Spark User List]" Sent: ‎4/‎13/‎2016 11:02 AM To: "Amit Singh Hora" Subject: Re: Unable to Access files in Hadoop HA enabled from using Spark Try adding the following property into hdfs-site.xml dfs

Re: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread ashesh_28
Try adding the following property into hdfs-site.xml dfs.client.failover.proxy.provider. org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-Access-files-in-Had

Unable to Access files in Hadoop HA enabled from using Spark

2016-04-12 Thread Amit Singh Hora
I am trying to access directory in Hadoop from my Spark code on local machine.Hadoop is HA enabled . val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]") val sc=new SparkContext(conf) val distFile = sc.textFile("hdfs://hdpha/mini_newsgroups/") println(distFile.count()) but ge

Re: [ASK]:Dataframe number of column limit in Saprk 1.5.2

2016-04-12 Thread mdkhajaasmath
I am also looking for same information . In my case I need to create 190 columns.. Sent from my iPhone > On Apr 12, 2016, at 9:49 PM, Divya Gehlot wrote: > > Hi, > I would like to know does Spark Dataframe API has limit on creation of > number of columns? > > Thanks, > Divya

[ASK]:Dataframe number of column limit in Saprk 1.5.2

2016-04-12 Thread Divya Gehlot
Hi, I would like to know does Spark Dataframe API has limit on creation of number of columns? Thanks, Divya

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
You don't need multiple contexts to do this: http://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases On Tue, Apr 12, 2016 at 4:05 PM, Michael Segel wrote: > Reading from multiple sources within the same application? > > How would you connect to Hive for some data a

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui
Spark configurations specified at the command line for spark-submit should be passed to the JVM inside Julia process. You can refer to https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267 and https://github.com/apache/s

Re: How to estimate the size of dataframe using pyspark?

2016-04-12 Thread Buntu Dev
Thanks Davies, I've shared the code snippet and the dataset. Please let me know if you need any other information. On Mon, Apr 11, 2016 at 10:44 AM, Davies Liu wrote: > That's weird, DataFrame.count() should not require lots of memory on > driver, could you provide a way to reproduce it (could g

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
It looks like the issue is around impurity stats. After converting an rf model to old, and back to new (without disk storage or anything), and specifying the same num of features, same categorical features map, etc., DecisionTreeClassifier::predictRaw throws a null pointer exception here: overr

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
I managed to get to the map using MetadataUtils (it's a private ml package). There are still some issues, around feature names, etc. Trying to pin them down. From: as...@live.com To: ja...@gluru.co CC: user@spark.apache.org Subject: RE: ML Random Forest Classifier Date: Wed, 13 Apr 2016 00:50:31

Re: Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Nicholas Chammas
Yes, this is a known issue. The core devs are already aware of it. [CC dev] FWIW, I believe the Spark 1.6.1 / Hadoop 2.6 package on S3 is not corrupt. It may be the only 1.6.1 package that is not corrupt, though. :/ Nick On Tue, Apr 12, 2016 at 9:00 PM Augustus Hong wrote: > Hi all, > > I'm t

Spark 1.6.1 packages on S3 corrupt?

2016-04-12 Thread Augustus Hong
Hi all, I'm trying to launch a cluster with the spark-ec2 script but seeing the error below. Are the packages on S3 corrupted / not in the correct format? Initializing spark --2016-04-13 00:25:39-- http://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop1.tgz Resolving s3.amazonaw

RE: ML Random Forest Classifier

2016-04-12 Thread Ashic Mahtab
Hi James,Following on from the previous email, is there a way to get the categoricalFeatures of a Spark ML Random Forest? Essentially something I can pass to RandomForestClassificationModel.fromOld(oldModel, parent, categoricalFeatures, numClasses, numFeatures) I could construct it by hand, but

Spark acessing secured HDFS

2016-04-12 Thread vijikarthi
Hello, I am trying to understand Spark support to access secure HDFS cluster. My plan is to deploy "Spark on Mesos" which will access a secure HDFS cluster running elsewhere in the network. I am trying to understand how much of support do exist as of now? My understanding is Spark as of now sup

how to write pyspark interface to scala code?

2016-04-12 Thread AlexG
I have Scala Spark code for computing a matrix factorization. I'd like to make it possible to use this code from PySpark, so users can pass in a python RDD and receive back one without knowing or caring that Scala code is being called. Please point me to an example of code (e.g. somewhere in the S

Spark + Secure HDFS Cluster

2016-04-12 Thread Vijay Srinivasaraghavan
Hello, I am trying to understand Spark support to access secure HDFS cluster. My plan is to deploy Spark on Mesos which will access a secure HDFS cluster running elsewhere in the network. I am trying to understand how much of support do exist as of now?  My understanding is Spark as of now suppor

Fwd: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Segel
Sorry for duplicate(s), I forgot to switch my email address. > Begin forwarded message: > > From: Michael Segel > Subject: Re: Can i have a hive context and sql context in the same app ? > Date: April 12, 2016 at 4:05:26 PM MST > To: Michael Armbrust > Cc: Natu Lauchande , "user@spark.apache.

Silly question...

2016-04-12 Thread Michael Segel
Hi, This is probably a silly question on my part… I’m looking at the latest (spark 1.6.1 release) and would like to do a build w Hive and JDBC support. From the documentation, I see two things that make me scratch my head. 1) Scala 2.11 "Spark does not yet support its JDBC component for Sca

Re: Old hostname pops up while running Spark app

2016-04-12 Thread Bibudh Lahiri
Hi Ted, Thanks for your prompt reply. I am afraid clearing the DNS cache did not help. I did the following sudo /etc/init.d/dnsmasq restart on the two nodes I am using, as I did not have nscd, but still getting the same error. I am launching the master from 172.26.49.156, whose old name

RE: DStream how many RDD's are created by batch

2016-04-12 Thread David Newberger
Hi Natu, I believe you are correct one RDD would be created for each file. Cheers, David From: Natu Lauchande [mailto:nlaucha...@gmail.com] Sent: Tuesday, April 12, 2016 1:48 PM To: David Newberger Cc: user@spark.apache.org Subject: Re: DStream how many RDD's are created by batch Hi David, Tha

Re: Old hostname pops up while running Spark app

2016-04-12 Thread Ted Yu
FYI https://documentation.cpanel.net/display/CKB/How+To+Clear+Your+DNS+Cache#HowToClearYourDNSCache-MacOS ®10.10 https://www.whatsmydns.net/flush-dns.html#linux On Tue, Apr 12, 2016 at 2:44 PM, Bibudh Lahiri wrote: > Hi, > > I am trying to run a piece of code with logistic regression on > P

Old hostname pops up while running Spark app

2016-04-12 Thread Bibudh Lahiri
Hi, I am trying to run a piece of code with logistic regression on PySpark. I’ve run it successfully on my laptop, and I have run it previously on a standalone cluster mode, but the name of the server on which I am running it was changed in between (the old name was "IMPETUS-1466") by the admi

S3n performance (@AaronDavidson)

2016-04-12 Thread Martin Eden
Hi everyone, Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 20 node cluster (8 cores, 60Gb ram). That's about 17MBytes/sec per node. This seems sub optimal. The processing is very basic

Re: JavaRDD with custom class?

2016-04-12 Thread Ted Yu
You can find various examples involving Serializable Java POJO e.g. ./examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java Please pastebin some details on 'Task not serializable error' Thanks On Tue, Apr 12, 2016 at 12:44 PM, Daniel Valdivia wrote: > Hi, > > I'm moving some

Re: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Andrei
> > One part is passing the command line options, like “--master”, from the > JVM launched by spark-submit to the JVM where SparkContext resides Since I have full control over both - JVM and Julia parts - I can pass whatever options to both. But what exactly should be passed? Currently pipeline l

Re: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe

2016-04-12 Thread Ted Yu
bq. Most recent failure cause: Can you paste the remaining cause ? Which Spark release are you using ? Thanks On Tue, Apr 12, 2016 at 1:10 PM, AlexModestov wrote: > I get an error while I form a dataframe from the parquet file: > > Py4JJavaError: An error occurred while calling > z:org.apache

An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe

2016-04-12 Thread AlexModestov
I get an error while I form a dataframe from the parquet file: Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: org.apache.spark.sto

Re: build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng
agh.. typo. supposed to use cdh5.7.0. I rerun the command with the fix, but still get the same error. build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.7.0 -DskipTests clean package -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/build-spark-1-6-a

build spark 1.6 against cdh5.7 with hadoop 2.6.0 hbase 1.2: Failure

2016-04-12 Thread freedafeng
jdk: 1.8.0_77 scala: 2.10.4 mvn: 3.3.9. Slightly changed the pom.xml: $ diff pom.xml pom.original 130c130 < 2.6.0-cdh5.7.0-SNAPSHOT --- > 2.2.0 133c133 < 1.2.0-cdh5.7.0-SNAPSHOT --- > 0.98.7-hadoop2 command: build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0-cdh5.6.0 -DskipTes

JavaRDD with custom class?

2016-04-12 Thread Daniel Valdivia
Hi, I'm moving some code from Scala to Java and I just hit a wall where I'm trying to move an RDD with a custom data structure to java, but I'm not being able to do so: Scala Code: case class IncodentDoc(system_id: String, category: String, terms: Seq[String]) var incTup = inc_filtered.map(rec

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers
still not sure how to use this with a DataFrame, assuming i cannot convert it to a specific Dataset with .as (because i got lots of columns, or because at compile time these types are simply not known). i cannot specify the columns these operate on. i can resort to Row transformations, like this:

Re: DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande
Hi David, Thanks for you answer. I have a follow up question : I am using textFileStream , and listening in an S3 bucket for new files to process. Files are created every 5 minutes and my batch interval is 2 minutes . Does it mean that each file will be for one RDD ? Thanks, Natu On Tue, Apr

Re: [ML] Training with bias

2016-04-12 Thread Daniel Siegmann
Yes, that's what I was looking for. Thanks. On Tue, Apr 12, 2016 at 9:28 AM, Nick Pentreath wrote: > Are you referring to fitting the intercept term? You can use > lr.setFitIntercept (though it is true by default): > > scala> lr.explainParam(lr.fitIntercept) > res27: String = fitIntercept: wheth

Re: Aggregator support in DataFrame

2016-04-12 Thread Michael Armbrust
Did you see these? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/scala/typed.scala#L70 On Tue, Apr 12, 2016 at 9:46 AM, Koert Kuipers wrote: > i dont really see how Aggregator can be useful for DataFrame unless you > can specify what column

Re: ordering over structs

2016-04-12 Thread Michael Armbrust
Does the data actually fit in memory? Check the web ui. If it doesn't caching is not going to help you. On Tue, Apr 12, 2016 at 9:00 AM, Imran Akbar wrote: > thanks Michael, > > That worked. > But what's puzzling is if I take the exact same code and run it off a temp > table created from parqu

RE: DStream how many RDD's are created by batch

2016-04-12 Thread David Newberger
Hi, Time is usually the criteria if I’m understanding your question. An RDD is created for each batch interval. If your interval is 500ms then an RDD would be created every 500ms. If it’s 2 seconds then an RDD is created every 2 seconds. Cheers, David From: Natu Lauchande [mailto:nlaucha...@g

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Michael Armbrust
You can, but I'm not sure why you would want to. If you want to isolate different users just use hiveContext.newSession(). On Tue, Apr 12, 2016 at 1:48 AM, Natu Lauchande wrote: > Hi, > > Is it possible to have both a sqlContext and a hiveContext in the same > application ? > > If yes would the

Creating a New Cassandra Table From a DataFrame Schema

2016-04-12 Thread Prateek .
Hi, I am trying to create new Cassandra table by inferring schema from JSON: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md I am not able to get createCassandraTable function on Dataframe: import com.datastax.spark.connector._ df.createCassandraTable(

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-12 Thread Benjamin Kim
All, I have more of a general Scala JSON question. I have setup a notification on the S3 source bucket that triggers a Lambda function to unzip the new file placed there. Then, it saves the unzipped CSV file into another destination bucket where a notification is sent to a SQS topic. The conte

Re: Aggregator support in DataFrame

2016-04-12 Thread Koert Kuipers
i dont really see how Aggregator can be useful for DataFrame unless you can specify what columns it works on. Having to code Aggregators to always use Row and then extract the values yourself breaks the abstraction and makes it not much better than UserDefinedAggregateFunction (well... maybe still

How do i get a spark instance to use my log4j properties

2016-04-12 Thread Steve Lewis
Ok I am stymied. I have tried everything I can think of to get spark to use my own version of log4j.properties In the launcher code - I launch a local instance from a Java application I say -Dlog4j.configuration=conf/log4j.properties where conf/log4j.properties is user.dir - no luck Spark alwa

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread kevllino
Update: - I managed to login to the cluster - I want to use copy-dir to deploy my Python files to all nodes, I read I need to copy them to /ephemeral/hdfs. But I don't know how to move them from local to the cluster in HDFS? Thanks in advance, Kevin -- View this message in context: http://a

Re: History Server Refresh?

2016-04-12 Thread Miles Crawford
It is completed apps that are not showing up. I'm fine with incomplete apps not appearing. On Tue, Apr 12, 2016 at 6:43 AM, Steve Loughran wrote: > > On 12 Apr 2016, at 00:21, Miles Crawford wrote: > > Hey there. I have my spark applications set up to write their event logs > into S3 - this is

Re: [spark] build/sbt gen-idea error

2016-04-12 Thread Sean Owen
We just removed the gen-idea plugin. Just import the Maven project into IDEA or Eclipse. On Tue, Apr 12, 2016 at 4:52 PM, ImMr.K <875061...@qq.com> wrote: > But how to import spark repo into idea or eclipse? > > > > -- 原始邮件 -- > 发件人: Ted Yu > 发送时间: 2016年4月12日 23:38

Re: ordering over structs

2016-04-12 Thread Imran Akbar
thanks Michael, That worked. But what's puzzling is if I take the exact same code and run it off a temp table created from parquet, vs. a cached table - it runs much slower. 5-10 seconds uncached vs. 47-60 seconds cached. Any ideas why? Here's my code snippet: df = data.select("customer_id", st

Re??[spark] build/sbt gen-idea error

2016-04-12 Thread ImMr.K
But how to import spark repo into idea or eclipse? -- -- ??: Ted Yu : 2016??4??12?? 23:38 ??: ImMr.K <875061...@qq.com> : user : Re: build/sbt gen-idea error gen-idea doesn't seem to be a valid command: [warn] Ignoring load f

Re: Re:[spark] build/sbt gen-idea error

2016-04-12 Thread Marco Mistroni
Have you tried SBT eclipse plugin? Then u can run SBT eclipse and have ur spark project directly in eclipse Pls Google it and u shud b able to find ur way. If not ping me and I send u the plugin (I m replying from my phone) Hth On 12 Apr 2016 4:53 pm, "ImMr.K" <875061...@qq.com> wrote: But how to

Re: [spark] build/sbt gen-idea error

2016-04-12 Thread Ted Yu
See https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup On Tue, Apr 12, 2016 at 8:52 AM, ImMr.K <875061...@qq.com> wrote: > But how to import spark repo into idea or eclipse? > > > > -- 原始邮件 -- > *发件人:* Ted Yu > *

import data from s3 which region supports AWS4-HMAC-SHA256 only

2016-04-12 Thread QiuxuanZhu
Dear all, Is it possible for now that spark could import data from s3 which supports AWS4-HMAC-SHA256 only? such as frankfurt or beijing? I check the issues in github. It seems that upgrade jets3t failed in UT for now. https://github.com/apache/spark/pull/9306 So, any help? Thanks. -- 跑不完马拉松的

Re: build/sbt gen-idea error

2016-04-12 Thread Ted Yu
gen-idea doesn't seem to be a valid command: [warn] Ignoring load failure: no project loaded. [error] Not a valid command: gen-idea [error] gen-idea On Tue, Apr 12, 2016 at 8:28 AM, ImMr.K <875061...@qq.com> wrote: > Hi, > I have cloned spark and , > cd spark > build/sbt gen-idea > > got the fol

build/sbt gen-idea error

2016-04-12 Thread ImMr.K
Hi, I have cloned spark and , cd spark build/sbt gen-idea got the following output: Using /usr/java/jre1.7.0_09 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. [info] Loading project definition from /home/king/github/spark/project/project [info] Loading project

HiveContext in spark

2016-04-12 Thread Selvam Raman
I Could not able to use Insert , update and delete command in HiveContext. i am using spark 1.6.1 version and hive 1.1.0 Please find the error below. ​scala> hc.sql("delete from trans_detail where counter=1"); 16/04/12 14:58:45 INFO ParseDriver: Parsing command: delete from trans_detail wher

Starting Spark Job Remotely with Function Call

2016-04-12 Thread Prateek .
Hi, To start a spark job, remotely we are using: 1) spark-submit 2) Spark Job Server As for quick testing of my application in IDE, I am giving a call to function that makes Spark Context, set's all spark configuration and master, and executes the flow and I can see the output in

Importing hive thrift server

2016-04-12 Thread ram kumar
Hi, In spark-shell, we start hive thrift server by importing, import org.apache.spark.sql.hive.thriftserver._ Is there a package for importing it from pyspark Thanks

Performance of algorithm on top of Pregel non linearly depends on the number of iterations

2016-04-12 Thread lamerman
I try to run a greedy graph partitioning algorithm on top of Pregel. My algorithm is pretty easy val InitialMsg = val NoGroup = -1 def mergeMsg(msg1: Int, msg2: Int): Int = { msg1 min msg2 } def vprog(vertexId: VertexId, value: Long, message: Int): Long = { if (message == InitialM

Choosing an Algorithm in Spark MLib

2016-04-12 Thread Joe San
up vote down votefavorite I'm working on a problem where in I have some data sets about some power generating units. Each of these units have been activated to run in the past and

Re: how to deploy new code with checkpointing

2016-04-12 Thread Cody Koeninger
- Checkpointing alone isn't enough to get exactly-once semantics. Events will be replayed in case of failure. You must have idempotent output operations. - Another way to handle upgrades is to just start a second app with the new code, then stop the old one once everything's caught up. On Tue, A

Re: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Gourav Sengupta
allowing multiple SPARK context is different from hiveContext and sqlContext. hiveContext has more methods and properties than sqlContext and all the methods and properties. In the GUI you will be able to see SQL, SQL1, tabs when multiple sqlContexts are active. But why would you want that?

Re: History Server Refresh?

2016-04-12 Thread Steve Loughran
On 12 Apr 2016, at 00:21, Miles Crawford mailto:mil...@allenai.org>> wrote: Hey there. I have my spark applications set up to write their event logs into S3 - this is super useful for ephemeral clusters, I can have persistent history even though my hosts go away. A history server is set up to

Re: [ML] Training with bias

2016-04-12 Thread Nick Pentreath
Are you referring to fitting the intercept term? You can use lr.setFitIntercept (though it is true by default): scala> lr.explainParam(lr.fitIntercept) res27: String = fitIntercept: whether to fit an intercept term (default: true) On Mon, 11 Apr 2016 at 21:59 Daniel Siegmann wrote: > I'm trying

Re: HashingTF "compatibility" across Python, Scala?

2016-04-12 Thread Nick Pentreath
I should point out that actually the "ml" version of HashingTF does call into Java so that will be consistent across Python and Java. It's the "mllib" version in PySpark that implements its own version using Pythons "hash" function (while Java uses Object.hashCode). On Thu, 7 Apr 2016 at 18:19 Ni

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Kevin Eid
Hi, Thanks for your emails, I tried running your command but it returned: "No such file or directory". So I definitely need to move my local .py files to the cluster, I tried login but (before sshing) but could not find the master: ./spark-ec2 -k key -i key.pem login weather-cluster - then sshing

DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande
Hi, What's the criteria for the number of RDD's created for each micro bath iteration ? Thanks, Natu

RE: Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Sun, Rui
val ALLOW_MULTIPLE_CONTEXTS = booleanConf("spark.sql.allowMultipleContexts", defaultValue = Some(true), doc = "When set to true, creating multiple SQLContexts/HiveContexts is allowed." + "When set to false, only one SQLContext/HiveContext is allowed to be created " + "throug

Graphical representation of Spark Decision Tree . How to do it ?

2016-04-12 Thread Eli Super
Hi Spark Users, I need your help. I've some output after running DecisionTree : I work with Jupyter notebook and python 2.7 How I can create a graphical representation of the Decision Tree model ? In sklearn I can use tree.export_graphviz , in R I can see the Decision Tree output as well .

RE: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Sun, Rui
Which py file is your main file (primary py file)? Zip the other two py files. Leave the main py file alone. Don't copy them to S3 because it seems that only local primary and additional py files are supported. ./bin/spark-submit --master spark://... --py-files -Original Message- From

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-12 Thread Sun, Rui
There is much deployment preparation work handling different deployment modes for pyspark and SparkR in SparkSubmit. It is difficult to summarize it briefly, you had better refer to the source code. Supporting running Julia scripts in SparkSubmit is more than implementing a ‘JuliaRunner’. One p

Re: Need Streaming output to single HDFS File

2016-04-12 Thread Sachin Aggarwal
hey u can use repartition and set it to 1 as in this example unionDStream.foreachRDD((rdd, time) => { val count = rdd.count() println("count" + count) if (count > 0) { print("rdd partition=" + rdd.partitions.length) val outputRDD = rdd.repartition(numFilesPerParti

Need Streaming output to single HDFS File

2016-04-12 Thread Priya Ch
Hi All, I am working with Kafka, Spark Streaming and I want to write the streaming output to a single file. dstream.saveAsTexFiles() is creating files in different folders. Is there a way to write to a single folder ? or else if written to different folders, how do I merge them ? Thanks, Padma C

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-12 Thread Jon Kjær Amundsen
Hi Ashesh You might be experiencing problems with the virtual memory allocation. Try grepping the yarn-hadoop-nodemanager-*.log (found in $HADOOP_INSTALL/logs) for 'virtual memory limits' If you se a message like - WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.Co

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Alonso Isidoro Roman
I don't know how to do it with python, but scala has a plugin named sbt-pack that creates an auto contained unix command with your code, no need to use spark-submit. It should be out there something similar to this tool. Alonso Isidoro Roman. Mis citas preferidas (de hoy) : "Si depurar es el pr

Installation Issues - Spark 1.6.0 With Hadoop 2.6 - Pre Built On Windows 7

2016-04-12 Thread My List
Dear Experts, Need help to get this resolved - What am I doing wrong? Any Help greatly appreciated. Env - Windows 7 - 64 bit OS Spark 1.6.0 With Hadoop 2.6 - Pre Built setup JAVA_HOME - point to 1.7 SCALA_HOME - 2.11 I have Admin User and Standard User on Windows. All the setups and running of

Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread kevllino
Hi, I need to know how to run a self-contained Spark app (3 python files) in a Spark standalone cluster. Can I move the .py files to the cluster, or should I store them locally, on HDFS or S3? I tried the following locally and on S3 with a zip of my .py files as suggested here

Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Natu Lauchande
Hi, Is it possible to have both a sqlContext and a hiveContext in the same application ? If yes would there be any performance pernalties of doing so. Regards, Natu

Init/Setup worker thread

2016-04-12 Thread Perrin, Lionel
Hello, I'm looking for a solution to use jruby on top of spark. It looks that the work required is quite small. The only tricky point is that I need to make sure that every worker thread has a ruby interpreter initialized. Basically, I need to register a function to be called when each worker t

Check if spark master/history server is running via Java

2016-04-12 Thread Mihir Monani
Hi, How to check if spark master /history server is running on node? is there any command for it? I would like to accomplish it with java if possible. Thanks, Mihir Monani