Re: seeing this message repeatedly.

2016-09-03 Thread kant kodali
I don't think my driver program which is running on my local machine can connect to worker/executor machine because spark UI lists private ip's for worker machine but I can connect to master from the driver because of this setting export SPARK_PUBLIC_DNS="52.44.36.224". really not sure how to fix

Re: Passing Custom App Id for consumption in History Server

2016-09-03 Thread ayan guha
How about this: 1. You create a primary key in your custom system. 2. Schedule the job with custom primary name as the job name. 3. After setting up spark context (inside the job) get the application id. Then save the mapping of App Name & AppId from spark job to your custom database, through

seeing this message repeatedly.

2016-09-03 Thread kant kodali
Hi Guys, I am running my driver program on my local machine and my spark cluster is on AWS. The big question is I don't know what are the right settings to get around this public and private ip thing on AWS? my spark-env.sh currently has the the following lines export

Re: any idea what this error could be?

2016-09-03 Thread Fridtjof Sander
I see. The default scala version changed to 2.11 with Spark 2.0.0 afaik, so that's probably the version you get when downloading prepackaged binaries. Glad I could help ;) Am 3. September 2016 23:59:51 MESZ, schrieb kant kodali : >@Fridtjof you are right! >changing it to

Re: any idea what this error could be?

2016-09-03 Thread kant kodali
@Fridtjof you are right! changing it to this Fixed it! ompile group: org.apache.spark' name: 'spark-core_2.11' version: '2.0.0' compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0' On Sat, Sep 3, 2016 12:30 PM, kant kodali kanth...@gmail.com wrote: I increased the

Creating RDD using swebhdfs with truststore

2016-09-03 Thread Sourav Mazumder
Hi, I am trying to create a RDD by using swebhdfs to a remote hadoop cluster which is protected by Knox and uses SSL. The code looks like this - sc.textFile("swebhdfs:/host:port/gateway/default/webhdfs/v1/").count. I'm passing the truststore and trustorepassword through extra java options

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Mich Talebzadeh
Mine is Hbase-0.98, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim
I’m using Spark 1.6 and HBase 1.2. Have you got it to work using these versions? > On Sep 3, 2016, at 12:49 PM, Mich Talebzadeh > wrote: > > I am trying to find a solution for this > > ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class >

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Mich Talebzadeh
I am trying to find a solution for this ERROR log: error in initSerDe: java.lang.ClassNotFoundException Class org.apache.hadoop.hive.hbase.HBaseSerDe not found I am using Spark 2 and Hive 2! HTH Dr Mich Talebzadeh LinkedIn *

Re: Spark SQL Tables on top of HBase Tables

2016-09-03 Thread Benjamin Kim
Mich, I’m in the same boat. We can use Hive but not Spark. Cheers, Ben > On Sep 2, 2016, at 3:37 PM, Mich Talebzadeh wrote: > > Hi, > > You can create Hive external tables on top of existing Hbase table using the > property > > STORED BY

Re: any idea what this error could be?

2016-09-03 Thread kant kodali
I increased the memory but nothing has changed I still get the same error. @Fridtjofon my driver side I am using the following dependenciescompile group: org.apache.spark' name: 'spark-core_2.10' version: '2.0.0' compile group: 'org.apache.spark' name: 'spark-streaming_2.10' version: '2.0.0' on

Re: Help with Jupyter Notebook Settup on CDH using Anaconda

2016-09-03 Thread Marco Mistroni
Hi please paste the exception for Spark vs Jupyter, you might want to sign up for this. It'll give you jupyter and spark...and presumably the spark-csv is already part of it ? https://community.cloud.databricks.com/login.html hth marco On Sat, Sep 3, 2016 at 8:10 PM, Arif,Mubaraka

Help with Jupyter Notebook Settup on CDH using Anaconda

2016-09-03 Thread Arif,Mubaraka
On the on-premise Cloudera Hadoop 5.7.2 I have installed the anaconda package and trying to setup Jupyter notebook to work with spark1.6.   I have ran into problems when I trying to use the package com.databricks:spark-csv_2.10:1.4.0 for reading and inferring the schema of the csv file using

Re: Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-09-03 Thread Diwakar Dhanuskodi
Please run with -X and post logs here. We can get exact error from it. On Sat, Sep 3, 2016 at 7:24 PM, Marco Mistroni wrote: > hi all > > i am getting failures when building spark 2.0 on Ubuntu 16.06 > Here's details of what i have installed on the ubuntu host > - java 8

Re: Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Gavin Yue
Any shuffling? > On Sep 3, 2016, at 5:50 AM, Сергей Романов wrote: > > Same problem happens with CSV data file, so it's not parquet-related either. > > > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >

Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Kapil Malik
Thanks Raghavendra :) Will look into Analyzer as well. Kapil Malik *Sr. Principal Engineer | Data Platform, Technology* M: +91 8800836581 | T: 0124-433 | EXT: 20910 ASF Centre A | 1st Floor | Udyog Vihar Phase IV | Gurgaon | Haryana | India *Disclaimer:* This communication is for the sole

Re: Passing Custom App Id for consumption in History Server

2016-09-03 Thread Raghavendra Pandey
Default implementation is to add milliseconds. For mesos it is framework-id. If you are using mesos, you can assume that your framework id used to register your app is same as app-id. As you said, you have a system application to schedule spark jobs, you can keep track of framework-ids submitted

Re: Importing large file with SparkContext.textFile

2016-09-03 Thread Somasundaram Sekar
If the file is not splittable(can I assume the log file is splittable, though) can you advise on how spark handles such case…? If Spark can't what is the widely used practice? On 3 Sep 2016 7:29 pm, "Raghavendra Pandey" wrote: If your file format is splittable say

Re: Pausing spark kafka streaming (direct) or exclude/include some partitions on the fly per batch

2016-09-03 Thread Cody Koeninger
Not built in, you're going to have to do some work. On Sep 2, 2016 16:33, "sagarcasual ." wrote: > Hi Cody, thanks for the reply. > I am using Spark 1.6.1 with Kafka 0.9. > When I want to stop streaming, stopping the context sounds ok, but for > temporarily excluding

Re: how to pass trustStore path into pyspark ?

2016-09-03 Thread Raghavendra Pandey
Did you try passing them in spark-env.sh? On Sat, Sep 3, 2016 at 2:42 AM, Eric Ho wrote: > I'm trying to pass a trustStore pathname into pyspark. > What env variable and/or config file or script I need to change to do this > ? > I've tried setting JAVA_OPTS env var but to

Re: Importing large file with SparkContext.textFile

2016-09-03 Thread Raghavendra Pandey
If your file format is splittable say TSV, CSV etc, it will be distributed across all executors. On Sat, Sep 3, 2016 at 3:38 PM, Somasundaram Sekar < somasundar.se...@tigeranalytics.com> wrote: > Hi All, > > > > Would like to gain some understanding on the questions listed below, > > > > 1.

Re: Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Raghavendra Pandey
Kapil -- I afraid you need to plugin your own SessionCatalog as ResolveRelations class depends on that. To keep up with consistent design you may like to implement ExternalCatalog as well. You can also look to plug in your own Analyzer class to give your more flexibility. Ultimately that is where

Pls assist: Spark 2.0 build failure on Ubuntu 16.06

2016-09-03 Thread Marco Mistroni
hi all i am getting failures when building spark 2.0 on Ubuntu 16.06 Here's details of what i have installed on the ubuntu host - java 8 - scala 2.11 - git When i launch the command ./build/mvn -Pyarn -Phadoop-2.7 -DskipTests clean package everything compiles sort of fine and at the end i

Re[7]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
And even more simple case: >>> df = sc.parallelize([1] for x in xrange(760857)).toDF() >>> for x in range(50, 70): print x, timeit.timeit(df.groupBy().sum(*(['_1'] * >>> x)).collect, number=1) 50 1.91226291656 51 1.50933384895 52 1.582903862 53 1.90537405014 54 1.84442877769 55 1.9177978 56

Re[6]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
Same problem happens with CSV data file, so it's not parquet-related either. Welcome to     __ / __/__  ___ _/ /__     _\ \/ _ \/ _ `/ __/  '_/    /__ / .__/\_,_/_/ /_/\_\   version 2.0.0   /_/ Using Python version 2.7.6 (default, Jun 22 2015 17:58:13)

Re[5]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
Hi, I had narrowed down my problem to a very simple case. I'm sending 27kb parquet in attachment. (file:///data/dump/test2 in example) Please, can you take a look at it? Why there is performance drop after 57 sum columns? Welcome to     __ / __/__  ___ _/ /__     _\

Catalog, SessionCatalog and ExternalCatalog in spark 2.0

2016-09-03 Thread Kapil Malik
Hi all, I have a Spark SQL 1.6 application in production which does following on executing sqlContext.sql(...) - 1. Identify the table-name mentioned in query 2. Use an external database to decide where's the data located, in which format (parquet or csv or jdbc) etc. 3. Load the dataframe 4.

Re[4]: Spark 2.0: SQL runs 5x times slower when adding 29th field to aggregation.

2016-09-03 Thread Сергей Романов
Hi, Mich, I don't think it is related to Hive or parquet partitioning. Same issue happens while working with non-partitioned parquet file using python Dataframe API. Please, take a look at following example: $ hdfs dfs -ls /user/test   // I had copied partition dt=2016-07-28 to another

Re: any idea what this error could be?

2016-09-03 Thread Fridtjof Sander
There is an InvalidClassException complaining about non-matching serialVersionUIDs. Shouldn't that be caused by different jars on executors and driver? Am 03.09.2016 1:04 nachm. schrieb "Tal Grynbaum" : > My guess is that you're running out of memory somewhere. Try to

Re: any idea what this error could be?

2016-09-03 Thread Tal Grynbaum
My guess is that you're running out of memory somewhere. Try to increase the driver memory and/or executor memory. On Sat, Sep 3, 2016, 11:42 kant kodali wrote: > I am running this on aws. > > > > On Fri, Sep 2, 2016 11:49 PM, kant kodali kanth...@gmail.com wrote: > >> I am

Importing large file with SparkContext.textFile

2016-09-03 Thread Somasundaram Sekar
Hi All, Would like to gain some understanding on the questions listed below, 1. When processing a large file with Apache Spark, with, say, sc.textFile("somefile.xml"), does it split it for parallel processing across executors or, will it be processed as a single chunk in a single

Need a help in row repetation

2016-09-03 Thread Selvam Raman
I have my dataset as dataframe. Using spark 1.5.0 version cola,colb,colc,cold,cole,colf,colg,colh,coli -> columns in row In the above column date fileds column are (colc,colf,colh,coli). scenario:((colc -2016,colf -2016,colh -2016,coli -2016) if all the year are same, no need of any logic.

Hive connection issues in spark-shell

2016-09-03 Thread Diwakar Dhanuskodi
Hi, I recently built spark using maven. Now when starting spark-shell, it couldn't connect hive and getting below error I couldn't find datanucleus jar in built library. But datanucleus jar is available in hive/lib folders. java.lang.ClassNotFoundException:

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
Sorry my bad. In both the runs I included -Dscala-2.11 On Sat, Sep 3, 2016 at 12:39 PM, Nachiketa wrote: > I think the difference was the -Dscala2.11 to the command line. > > I have seen this show up when I miss that. > > Regards, > Nachiketa > > On Sat 3 Sep, 2016,

Re: Spark build 1.6.2 error

2016-09-03 Thread Nachiketa
I think the difference was the -Dscala2.11 to the command line. I have seen this show up when I miss that. Regards, Nachiketa On Sat 3 Sep, 2016, 12:14 PM Diwakar Dhanuskodi, < diwakar.dhanusk...@gmail.com> wrote: > Hi, > > Just re-ran again without killing zinc server process > >

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
Hi, Just re-ran again without killing zinc server process /make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Phive -Pyarn -Dmaven.version=3.0.4 -Dscala-2.11 -X -rf :spark-sql_2.11 Build is success. Not sure how it worked with just re-running command again. On Sat, Sep 3, 2016 at

Re: Spark build 1.6.2 error

2016-09-03 Thread Diwakar Dhanuskodi
Hi, java version 7 mvn command ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.6 -Phive -Phive-thriftserver -Pyarn -Dmaven.version=3.0.4 yes, I executed script to change scala version to 2.11 killed "com.typesafe zinc.Nailgun" process re-ran mvn with below command again