Unable run Spark in YARN mode

2016-04-08 Thread maheshmath
I have set SPARK_LOCAL_IP=127.0.0.1 still getting below error 16/04/09 10:36:50 INFO spark.SecurityManager: Changing view acls to: mahesh 16/04/09 10:36:50 INFO spark.SecurityManager: Changing modify acls to: mahesh 16/04/09 10:36:50 INFO spark.SecurityManager: SecurityManager: authentication

How Spark handles dead machines during a job.

2016-04-08 Thread Sung Hwan Chung
Hello, Say, that I'm doing a simple rdd.map followed by collect. Say, also, that one of the executors finish all of its tasks, but there are still other executors running. If the machine that hosted the finished executor gets terminated, does the master still have the results from the finished

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Natu Lauchande
Hi Benjamin, I have done it . The critical configuration items are the ones below : ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", AccessKeyId)

Unsubscribe

2016-04-08 Thread Db-Blog
Unsubscribe > On 06-Apr-2016, at 5:40 PM, Brian London wrote: > > unsubscribe - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Work on Spark engine for Hive

2016-04-08 Thread Szehon Ho
So I only know that latest CDH released version does have Hive (based on 1.2) on Spark 1.6 , though admittedly have not tested Hive 2.0 branch on that. So I would recommend for you try the latest 1.6-based Spark assembly from CDH (the version that we test) to rule out possibility of building it

Re: Copying all Hive tables from Prod to UAT

2016-04-08 Thread Xiao Li
You also need to ensure no workload is running on both sides. 2016-04-08 15:54 GMT-07:00 Ali Gouta : > For hive, you may use sqoop to achieve this. In my opinion, you may also > run a spark job to make it.. > Le 9 avr. 2016 00:25, "Ashok Kumar"

Re: Copying all Hive tables from Prod to UAT

2016-04-08 Thread Ali Gouta
For hive, you may use sqoop to achieve this. In my opinion, you may also run a spark job to make it.. Le 9 avr. 2016 00:25, "Ashok Kumar" a écrit : Hi, Anyone has suggestions how to create and copy Hive and Spark tables from Production to UAT. One way would be to

Copying all Hive tables from Prod to UAT

2016-04-08 Thread Ashok Kumar
Hi, Anyone has suggestions how to create and copy Hive and Spark tables from Production to UAT. One way would be to copy table data to external files and then move the external files to a local target directory and populate the tables in target Hive with data. Is there an easier way of doing

Re: Need clarification regd deploy-mode client

2016-04-08 Thread bdev
Thanks Mandar for the clarification. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-clarification-regd-deploy-mode-client-tp26719p26725.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
Not much. So no chance of different snappy version ? On Fri, Apr 8, 2016 at 1:26 PM, Nicolas Tilmans wrote: > Hi Ted, > > The Spark version is 1.6.1, a nearly identical set of operations does fine > on smaller datasets. It's just a few joins then a groupBy and a count in >

Re: DataFrame job fails on parsing error, help?

2016-04-08 Thread Ted Yu
Did you encounter similar error on a smaller dataset ? Which release of Spark are you using ? Is it possible you have an incompatible snappy version somewhere in your classpath ? Thanks On Fri, Apr 8, 2016 at 12:36 PM, entee wrote: > I'm trying to do a relatively large

Re: Work on Spark engine for Hive

2016-04-08 Thread Mich Talebzadeh
The fundamental problem seems to be the spark-assembly-n.n.n-hadoopn.n..jar libraries that are incompatible and cause issues. For example Hive does not work with existing Spark 1.6.1 binaries, In other words if you set hive.execution.engine in the following $HIVE_HOME/cong/hive-site.xml

DataFrame job fails on parsing error, help?

2016-04-08 Thread entee
I'm trying to do a relatively large join (0.5TB shuffle read/write) and just calling a count (or show) on the dataframe fails to complete, getting to the last task before failing: Py4JJavaError: An error occurred while calling o133.count. : org.apache.spark.SparkException: Job aborted due to

Re: ordering over structs

2016-04-08 Thread Michael Armbrust
You need to use the struct function (which creates an actual struct), you are trying to use the struct datatype (which just represents the schema of a struct). On Thu, Apr 7, 2016 at 3:48 PM, Imran

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-08 Thread jamborta
You could also try to put transform in a companion object. On Fri, 8 Apr 2016 16:48 mpawashe [via Apache Spark User List], < ml-node+s1001560n26718...@n3.nabble.com> wrote: > The class declaration is already marked Serializable ("with Serializable") > > -- > If you

Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Benjamin Kim
Has anyone monitored an S3 bucket or directory using Spark Streaming and pulled any new files to process? If so, can you provide basic Scala coding help on this? Thanks, Ben - To unsubscribe, e-mail:

Handling of slow elements in dstream's processing

2016-04-08 Thread OmkarP
Problem statement: I am building a somewhat time-critical application that is supposed to receive messages on a stream(ZMQ) and then operate on each of the data points that comes in on the stream. The caveat is that some data points may need more time for processing than most others since the

Re: Spark Streaming share state between two streams

2016-04-08 Thread Rishi Mishra
Hi Shekhar, As both of your state functions does the same thing can't you do a union of dtsreams before applying mapWithState() ? It might be difficult if one state function is dependent on other state. This requires a named state, which can be accessed in other state functions. I have not gone

Re: Work on Spark engine for Hive

2016-04-08 Thread Szehon Ho
Yes, that is a good goal we will have to do eventually. I was not aware that it is not working to be honest. Can you let us know what is broken on Hive 2 on Spark 1.6.1? Preferably via filing a JIRA on HIVE side? On Fri, Apr 8, 2016 at 7:47 AM, Mich Talebzadeh

What is the best way to process streaming data from multiple channels simultaneously using Spark 2.0 API's?

2016-04-08 Thread imax
I’d like to use Spark 2.0 (streaming) API to consume data from a custom data source that provides API for random access to stream of data that represented as a “topic” that have collection of partitions that might be accessed/consumed simultaneously. I want to implement a streaming process using

Re: How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread nihed mbarek
I can't write on hadoopConfig in Java Le vendredi 8 avril 2016, Silvio Fiorito a écrit : > Have you tried: > > sc.hadoopConfiguration.setLong(parquet.hadoop.ParquetOutputFormat.BLOCK_SIZE, > N * 1024 * 1024) > > Not sure if it’d work or not, but since it’s getting

Need clarification regd deploy-mode client

2016-04-08 Thread bdev
I'm running pyspark with deploy-mode as client with yarn using dynamic allocation: pyspark --master yarn --deploy-mode client --executor-memory 6g --executor-cores 4 --driver-memory 4g The node where I'm running pyspark has 4GB memory but I keep running out of memory on this node. If using yarn,

Re: How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread Ted Yu
I searched 1.6.1 code base but didn't find how this can be configured (within Spark). On Fri, Apr 8, 2016 at 9:01 AM, nihed mbarek wrote: > Hi > How to configure parquet.block.size on Spark 1.6 ? > > Thank you > Nihed MBAREK > > > -- > > M'BAREK Med Nihed, > Fedora

How to configure parquet.block.size on Spark 1.6

2016-04-08 Thread nihed mbarek
Hi How to configure parquet.block.size on Spark 1.6 ? Thank you Nihed MBAREK -- M'BAREK Med Nihed, Fedora Ambassador, TUNISIA, Northern Africa http://www.nihed.com

Re: Spark Streaming - NotSerializableException: Methods & Closures:

2016-04-08 Thread mpawashe
The class declaration is already marked Serializable ("with Serializable") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-NotSerializableException-Methods-Closures-tp26672p26718.html Sent from the Apache Spark User List mailing list archive

Good Book on Hadoop Interview

2016-04-08 Thread Chaturvedi Chola
A good book on big data interview FAQ http://www.amazon.in/Big-Data-Interview-FAQs-Chinnasamy/dp/9386009188/ref=sr_1_1?ie=UTF8=1459943243=8-1=9789386009180 https://notionpress.com/read/big-data-interview-faqs

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-08 Thread ashesh_28
Hi Dhiraj , Thanks for the clarification , Yes i indeed checked that Both YARN related (Nodemanager & ResourceManager) daemons are running in their respective nodes and i can access HDFS directory structure from each node. I am using Hadoop version 2.7.2 and i have downloaded Pre-build version

Re: Work on Spark engine for Hive

2016-04-08 Thread Mich Talebzadeh
This is a different thing. the question is when will Hive 2 be able to run on Spark 1.6.1 installed binaries as execution engine. Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Sqoop on Spark

2016-04-08 Thread Mich Talebzadeh
Well unless you have plenty of memory, you are going to have certain issues with Spark. I tried to load a billion rows table from oracle through spark using JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space" error. Sqoop uses MapR and does it in serial mode which

How to import D3 library in Spark

2016-04-08 Thread Chadha Pooja
Hi I am using Amazon EMR for running Spark and would like to reproduce something similar to the graph in the end of this link : https://docs.cloud.databricks.com/docs/latest/featured_notebooks/Wikipedia%20Clickstream%20Data.html Can someone help me with how to import d3 library in Spark ? I am

Streaming k-means visualization

2016-04-08 Thread Priya Ch
Hi All, I am using Streaming k-means to train my model on streaming data. Now I want to visualize the clusters. What would be the reporting tool used for this ? Would zeppelin used to visualize the clusters Regards, Padma Ch

Re: can not join dataset with itself

2016-04-08 Thread JH P
I’m using Spark 1.6.1 Class is case class DistinctValues(statType: Int, dataType: Int, _id: Int, values: Array[(String, Long)], numOfMembers: Int,category: String) and error for newGnsDS.joinWith(newGnsDS, $"dataType”) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot

Re: can not join dataset with itself

2016-04-08 Thread Ted Yu
Looks like you're using Spark 1.6.x What error(s) did you get for the first two joins ? Thanks On Fri, Apr 8, 2016 at 3:53 AM, JH P wrote: > Hi. I want a dataset join with itself. So i tried below codes. > > 1. newGnsDS.joinWith(newGnsDS, $"dataType”) > > 2.

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-08 Thread ashesh_28
Few more added information with Nodes Memory and Core ptfhadoop01v - 4GB ntpcam01v - 1GB ntpcam03v - 2GB Each of the VM has only 1 core CPU -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-on-Yarn-Client-Cluster-mode-tp26691p26714.html Sent

Re: [HELP:]Save Spark Dataframe in Phoenix Table

2016-04-08 Thread Josh Mahonin
Hi Divya, That's strange. Are you able to post a snippet of your code to look at? And are you sure that you're saving the dataframes as per the docs ( https://phoenix.apache.org/phoenix_spark.html)? Depending on your HDP version, it may or may not actually have phoenix-spark support.

can not join dataset with itself

2016-04-08 Thread JH P
Hi. I want a dataset join with itself. So i tried below codes. 1. newGnsDS.joinWith(newGnsDS, $"dataType”) 2. newGnsDS.as("a").joinWith(newGnsDS.as("b"), $"a.dataType" === $"b.datatype”) 3. val a = newGnsDS.map(x => x).as("a") val b = newGnsDS.map(x => x).as("b") a.joinWith(b,

Spark Streaming share state between two streams

2016-04-08 Thread Shekhar Bansal
HiCan we share spark streaming state between two DStreams??Basically I want to create state using first stream and enrich second stream using state.Example: I have modified StatefulNetworkWordCount example. I am creating state using first stream and enriching second stream with count of first

Re: About nested RDD

2016-04-08 Thread Rishi Mishra
As mentioned earlier you can create a broadcast variable containing all the small RDD elements. I hope they are really small. Then you can fire A.updatae(broadcastVariable). Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra On Fri,

Re: Sqoop on Spark

2016-04-08 Thread Gourav Sengupta
Hi, Some metrics thrown around the discussion: SQOOP: extract 500 million rows (in single thread) 20 mins (data size 21 GB) SPARK: load the data into memory (15 mins) SPARK: use JDBC (and similar to SQOOP difficult parallelization) to load 500 million records - manually killed after 8 hours.

Re: Running Spark on Yarn-Client/Cluster mode

2016-04-08 Thread ashesh_28
Hi , Just a Quick Update , After trying for a while , i rebooted all the Three machines used in the Cluster and formatted namenode and ZKFC . Then i started every Daemon in the Cluster. After all the Daemons were up and Running i tried to issue the same command as earlier

Re: About nested RDD

2016-04-08 Thread Tenghuan He
Hi Thanks for your reply. Yes, It's very much like the union() method, but there is some difference. I have a very large RDD A, and a lot of small RDDs b, c, d and so on. and A.update(a) will update some element in the A and return a new RDD when calling val B =

Re: Why do I need to handle dependencies on EMR but not on-prem Hadoop?

2016-04-08 Thread Sean Owen
You probably just got lucky, and the default Python distribution on your CDH nodes has this library but the EMR one doesn't. (CDH actually has an Anaconda distribution, not sure if you enabled that.) In general you need to make dependencies available that your app does not supply. On Fri, Apr 8,

Work on Spark engine for Hive

2016-04-08 Thread Mich Talebzadeh
Hi, Is there any scheduled work to enable Hive to use recent version of Spark engines? This is becoming an issue as some applications have to rely on MapR engine to do operations on Hive 2 which is serial and slow. Thanks Dr Mich Talebzadeh LinkedIn *

Re: May I ask a question about SparkSql

2016-04-08 Thread Mich Talebzadeh
Hi Jackie, Can you create a DF from RDD and register it as temp table? This should work. Although I have not tried it. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: May I ask a question about SparkSql

2016-04-08 Thread Kasinathan, Prabhu
Check this one. https://github.com/Intel-bigdata/spark-streamingsql. We tried it and it was working with Spark 1.3.1. You can do ETL on Spark Streaming Context using Spark Sql. Thanks Prabhu From: Hustjackie > Reply-To:

Re: About nested RDD

2016-04-08 Thread Rishi Mishra
rdd.count() is a fairly straightforward operations which can be calculated on a driver and then the value can be included in the map function. Is your goal is to write a generic function which operates on two rdds, one rdd being evaluated for each partition of the other ? Here also you can use

May I ask a question about SparkSql

2016-04-08 Thread Hustjackie
Hi all, I have several jobs running with Spark-Streaming, but I prefer to run some sql to do the same things.So does the SparkSql support real-time jobs, in another words, Do spark support spark streaming SQL. Thanks in advance, any help are appreciate. Jacky

Re: About nested RDD

2016-04-08 Thread Holden Karau
It seems like the union function on RDDs might be what you are looking for, or was there something else you were trying to achieve? On Thursday, April 7, 2016, Tenghuan He wrote: > Hi all, > > I know that nested RDDs are not possible like linke rdd1.map(x => x + >

Why do I need to handle dependencies on EMR but not on-prem Hadoop?

2016-04-08 Thread YaoPau
On-prem I'm running PySpark on Cloudera's distribution, and I've never had to worry about dependency issues. I import my libraries on my driver node only using pip or conda, run my jobs in yarn-client mode, and everything works (I just assumed the relevant libraries are copied temporarily to each

Re: MLlib ALS MatrixFactorizationModel.save fails consistently

2016-04-08 Thread Nick Pentreath
Could you post some stack trace info? Generally, it can be problematic to run Spark within a web server framework as often there are dependency conflict and threading issues. You might prefer to run the model-building as a standalone app, or check out

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-04-08 Thread Wojciech Indyk
Hello Divya! Have you solved the problem? I suppose the log comes from driver. You need to look also at logs on worker JVMs, there can be an exception or something. Do you have Kerberos on your cluster? It could be similar to a problem http://issues.apache.org/jira/browse/SPARK-14115 Based on

how to use udf in spark thrift server.

2016-04-08 Thread zhanghn
I want to define some UDFs in my spark ENV. And server it in thrift server. So I can use these UDFs in my beeline connection. At first I tried start it with udf-jars and create functions in hive. In spark-sql , I can add temp functions like "CREATE TEMPORARY FUNCTION bsdUpper AS