Re: Spark 1.6.0 - token renew failure

2016-04-13 Thread Jeff Zhang
It is not supported in spark to specify both principal and proxy-user. You need to either use proxy-user or use principal. Seems currently spark only check that from spark submit arguments but ignore the configuration in spark-defaults.xml if (proxyUser != null && principal != null) {

RE: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Sun, Rui
In SparkSubmit, there is less work for yarn-client than that for yarn-cluster. Basically prepare some spark configurations into system prop , for example, information on additional resources required by the application that need to be distributed to the cluster. These configurations will be

Memory needs when using expensive operations like groupBy

2016-04-13 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with exit code 143 . except one job where I am using unionAll and groupBy operation on multiple columns . Please advice me the options to optimize it . The one option which I am using it now --conf

RE: 回复: build/sbt gen-idea error

2016-04-13 Thread Yu, Yucai
Reminder: gen-idea has been removed in the master. See: commit a172e11cba6f917baf5bd6c4f83dc6689932de9a Author: Luciano Resende Date: Mon Apr 4 16:55:59 2016 -0700 [SPARK-14366] Remove sbt-idea plugin ## What changes were proposed in this pull request?

pyspark EOFError after calling map

2016-04-13 Thread Pete Werner
Hi I am new to spark & pyspark. I am reading a small csv file (~40k rows) into a dataframe. from pyspark.sql import functions as F df = sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('/tmp/sm.csv') df = df.withColumn('verified',

Error at starting httpd after the instillation using spark-ec2 script

2016-04-13 Thread Mohed Alibrahim
Dear All, I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything was OK, but , it failed to start httpd at the end of the installation. I followed exactly the instruction and I repeated the process many times, but there is no luck. - [timing] rstudio setup: 00h

Re: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-13 Thread Andrei
> > Julia can pick the env var, and set the system properties or directly fill > the configurations into a SparkConf, and then create a SparkContext That's the point - just setting master to "yarn-client" doesn't work, even in Java/Scala. E.g. following code in *Scala*: val conf = new

Re: Strange bug: Filter problem with parenthesis

2016-04-13 Thread Michael Armbrust
You need to use `backticks` to reference columns that have non-standard characters. On Wed, Apr 13, 2016 at 6:56 AM, wrote: > Hi, > > I am debugging a program, and for some reason, a line calling the > following is failing: > > df.filter("sum(OpenAccounts) >

error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-13 Thread AlexModestov
I get this error. Who knows what does it mean? Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result:

RE: how does sc.textFile translate regex in the input.

2016-04-13 Thread Yong Zhang
It is described in "Hadoop Definition Guild", chapter 3, FilePattern https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch03.html#FilePatterns Yong From: pradeep1...@gmail.com Date: Wed, 13 Apr 2016 18:56:58 + Subject: how does sc.textFile translate regex in

Re: Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread Mich Talebzadeh
actually how many tables are involved here. what is the version of Hive used? Sorry I have no idea about Cloudera 5.5.1 spec. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

how does sc.textFile translate regex in the input.

2016-04-13 Thread Pradeep Nayak
I am trying to understand on how spark's sc.textFile() works. I specifically have the question on how it translates the paths with regex in it. For example: files = sc.textFile("hdfs://:/file1/*/*/*/*.txt") How does it find all the sub-directories and recurses to all the leaf files. ? Is there

Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet file (too small)

2016-04-13 Thread pseudo oduesp
hi guys , i have this error after 5 hours of processing i make lot of joins 14 left joins with small table : i saw in the spark ui and console log evrithing ok but when he save last join i get this error Py4JJavaError: An error occurred while calling o115.parquet. _metadata is not a Parquet

Error starting Spark 1.6.1

2016-04-13 Thread Mohed Alibrahim
Dear All, I installed spark 1.6.1 on Amazon EC2 using spark-ec2 script. Everything was OK, but , it failed to start httpd at the end of the installation. I followed exactly the instruction and I repeated the process many times, but there is no luck. - [timing] rstudio setup: 00h

Re: Logging in executors

2016-04-13 Thread Carlos Rojas Matas
Hi Yong, thanks for your response. As I said in my first email, I've tried both the reference to the classpath resource (env/dev/log4j-executor.properties) as the file:// protocol. Also, the driver logging is working fine and I'm using the same kind of reference. Below the content of my

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh
These are the components *java -versionjava version "1.8.0_77"*Java(TM) SE Runtime Environment (build 1.8.0_77-b03) Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode) *hadoop versionHadoop 2.6.0*Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r

RE: Logging in executors

2016-04-13 Thread Yong Zhang
Is the env/dev/log4j-executor.properties file within your jar file? Is the path matching with what you specified as env/dev/log4j-executor.properties? If you read the log4j document here: https://logging.apache.org/log4j/1.2/manual.html When you specify the

Re: Logging in executors

2016-04-13 Thread Carlos Rojas Matas
Thanks for your response Ted. You're right, there was a typo. I changed it, now I'm executing: bin/spark-submit --master spark://localhost:7077 --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=env/dev/log4j-driver.properties" --conf

Blog: Better Feature Engineering with Spark, Solr, and Lucene Analyzers

2016-04-13 Thread Steve Rowe
FYI, I wrote functionality to enable Lucene text analysis components to be used to extract text features via a transformer in spark.ml pipelines. Non-machine-learning uses supported too. See my blog describing the capabilities, which are included in the open-source spark-solr project:

Re: Silly question...

2016-04-13 Thread Michael Segel
Mich Are you building your own releases from the source? Which version of Scala? Again, the builds seem to be ok and working, but I don’t want to hit some ‘gotcha’ if I could avoid it. > On Apr 13, 2016, at 7:15 AM, Mich Talebzadeh > wrote: > > Hi, > > I am

Re: Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Ted Yu
w.r.t. the effective storage level log, here is the JIRA which introduced it: [SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled On Wed, Apr 13, 2016 at 7:43 AM, Patrick McGloin wrote: > Hi all, > > If I am using a Custom Receiver with

EMR Spark log4j and metrics

2016-04-13 Thread Peter Halliday
I have an existing cluster that I stand up via Docker images and CloudFormation Templates on AWS. We are moving to EMR and AWS Data Pipeline process, and having problems with metrics and log4j. We’ve sent a JSON configuration for spark-log4j and spark-metrics. The log4j file seems to be

Re: Spark acessing secured HDFS

2016-04-13 Thread vijikarthi
Looks like the support does not exist unless someone counter it and there is a open JIRA. https://issues.apache.org/jira/browse/SPARK-12909 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-acessing-secured-HDFS-tp26766p26778.html Sent from the Apache

Re: Logging in executors

2016-04-13 Thread Ted Yu
bq. --conf "spark.executor.extraJavaOptions=-Dlog4j. configuration=env/dev/log4j-driver.properties" I think the above may have a typo : you refer to log4j-driver.properties in both arguments. FYI On Wed, Apr 13, 2016 at 8:09 AM, Carlos Rojas Matas wrote: > Hi guys, > >

Logging in executors

2016-04-13 Thread Carlos Rojas Matas
Hi guys, I'm trying to enable logging in the executors but with no luck. According to the oficial documentation and several blogs, this should be done passing the "spark.executor.extraJavaOpts=-Dlog4j.configuration=[my-file]" to the spark-submit tool. I've tried both sending a reference to a

Streaming WriteAheadLogBasedBlockHandler disallows parellism via StorageLevel replication factor

2016-04-13 Thread Patrick McGloin
Hi all, If I am using a Custom Receiver with Storage Level set to StorageLevel. MEMORY_ONLY_SER_2 and the WAL enabled I get this Warning in the logs: 16/04/13 14:03:15 WARN WriteAheadLogBasedBlockHandler: Storage level replication 2 is unnecessary when write ahead log is enabled, change to

Re: Silly question...

2016-04-13 Thread Mich Talebzadeh
Hi, I am not sure this helps. we use Spark 1.6 and Hive 2. I also use JDBC (beeline for Hive) plus Oracle and Sybase. They all work fine. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark fileStream from a partitioned hive dir

2016-04-13 Thread Daniel Haviv
Hi, We have a hive table which gets data written to it by two partition keys, day and hour. We would like to stream the incoming files assince fileStream can only listen on one directory we start a streaming job on the latest partition and every hour kill it and start a new one on a newer

Strange bug: Filter problem with parenthesis

2016-04-13 Thread Saif.A.Ellafi
Hi, I am debugging a program, and for some reason, a line calling the following is failing: df.filter("sum(OpenAccounts) > 5").show It says it cannot find the column OpenAccounts, as if it was applying the sum() function and looking for a column called like that, where there is not. This

?????? build/sbt gen-idea error

2016-04-13 Thread ImMr.K
Actually, same error occurred when I ran build/sbt compile or other commands. After struggled for some time, I reminded that I used proxy to connect to Internet. So set proxy to maven, everything seems OK. Just remind those who use proxies. -- Best regards, Ze Jin

Re: Please assist: Spark 1.5.2 / cannot find StateSpec / State

2016-04-13 Thread Matthias Niehoff
The StateSpec and the mapWithState method is only available in Spark 1.6.x 2016-04-13 11:34 GMT+02:00 Marco Mistroni : > hi all > i am trying to replicate the Streaming Wordcount example described here > > >

Problem with History Server

2016-04-13 Thread alvarobrandon
Hello: I'm using the history server to keep track of the applications I run in my cluster. I'm using Spark with YARN. When I run on application it finishes correctly even YARN says that it finished. This is the result of the YARN Resource Manager API {u'app': [{u'runningContainers': -1,

Performance Tuning | Shark 0.9.1 with Spark 1.0.2

2016-04-13 Thread N, Manjunath 3. (EXT - IN/Noida)
Hi, I am trying to reduce the query performance. I am not sure how to go about in shark/spark this. Here is my problem. When I execute a query it is ran twice and here is summary. First is Filesink's runjob and next is mapPartitionis executed. 1. Filesink uses only one job always is

Re: S3n performance (@AaronDavidson)

2016-04-13 Thread Gourav Sengupta
Hi, I have stopped working on s3n for a long time now. In case you are working with parquet and writing files s3a is the only alternative to failures. Otherwise why not use just s3://? Regards, Gourav On Wed, Apr 13, 2016 at 12:17 PM, Steve Loughran wrote: > > On 12

Re: S3n performance (@AaronDavidson)

2016-04-13 Thread Steve Loughran
On 12 Apr 2016, at 22:05, Martin Eden > wrote: Hi everyone, Running on EMR 4.3 with Spark 1.6.0 and the provided S3N native driver I manage to process approx 1TB of strings inside gzipped parquet in about 50 mins on a 20 node cluster (8

RE: ML Random Forest Classifier

2016-04-13 Thread Ashic Mahtab
It looks like all of that is building up to spark 2.0 (for random forests / gbts / etc.). Ah well...thanks for your help. Was interesting digging into the depths. Date: Wed, 13 Apr 2016 09:48:32 +0100 Subject: Re: ML Random Forest Classifier From: ja...@gluru.co To: as...@live.com CC:

Please assist: Spark 1.5.2 / cannot find StateSpec / State

2016-04-13 Thread Marco Mistroni
hi all i am trying to replicate the Streaming Wordcount example described here https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala in my build,sbt i have the following dependencies . libraryDependencies +=

Re: spark.driver.extraClassPath and export SPARK_CLASSPATH

2016-04-13 Thread AlexModestov
I wrote in "spark-defaults.conf" spark.driver.extraClassPath '/dir' or "PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /.../sparkling-water-1.6.1/bin/pysparkling \ --conf spark.driver.extraClassPath='/.../sqljdbc41.jar' Nothing works -- View this message in context:

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton
Hi Ashic, Unfortunately I don't know how to work around that - I suggested this line as it looked promising (I had considered it once before deciding to use a different algorithm) but I never actually tried it. Regards, James On 13 April 2016 at 02:29, Ashic Mahtab wrote: >

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread ashesh_28
Are you running from eclipse ? If so add the *Hadoop_conf_dir* path to the classpath And then you can access your hdfs directory as below object sparkExample { def main(args: Array[String]){ val logname = "///user/hduser/input/sample.txt" val conf = new

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread Amit Hora
Finally I tried setting the configuration manually using sc.hadoopconfiguration.set dfs.nameservices dfs.ha.namenodes.hdpha dfs.namenode.rpc-address.hdpha.n1 And it worked ,don't know why it was not reading these settings from file under HADOOP_CONF_DIR -Original Message-

RE: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread Amit Hora
There are DNS entries for both of my namenode Ambarimaster is standby and it resolves to ip perfectly Hdp231 is active and it also resolves to ip Hdpha is my Hadoop HA cluster name And hdfs-site.xml has entries related to these configuration -Original Message- From: "Jörn Franke"

Re: Unable to Access files in Hadoop HA enabled from using Spark

2016-04-13 Thread Jörn Franke
Is the host in /etc/hosts ? > On 13 Apr 2016, at 07:28, Amit Singh Hora wrote: > > I am trying to access directory in Hadoop from my Spark code on local > machine.Hadoop is HA enabled . > > val conf = new SparkConf().setAppName("LDA Sample").setMaster("local[2]") > val