Re: migration from Teradata to Spark SQL

2016-05-03 Thread Deepak Sharma
Hi Tapan I would suggest an architecture where you have different storage layer and data servng layer. Spark is still best for batch processing of data. So what i am suggesting here is you can have your data stored as it is in some hdfs raw layer , run your ELT in spark on this raw data and

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Divya Gehlot
Hi , Even I am getting the similar error Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile When I tried to build Phoenix Project using maven . Maven version : 3.3 Java version - 1.7_67 Phoenix - downloaded latest master from Git hub If anybody find the the resolution

回复: parquet table in spark-sql

2016-05-03 Thread 喜之郎
thanks for your answer, Sandeep . And also thanks,Varadharajan. -- 原始邮件 -- 发件人: "Sandeep Nemuri";; 发送时间: 2016年5月3日(星期二) 晚上8:48 收件人: "Varadharajan Mukundan"; 抄送: "喜之郎"<251922...@qq.com>; "user";

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Ted Yu
Which version are you using now ? I wonder if 1.8.0_91 had problem. Cheers On Tue, May 3, 2016 at 6:29 PM, sunday2000 <2314476...@qq.com> wrote: > Problem solved, by use a newer version javac: > [INFO] > > [INFO] BUILD

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread sunday2000
Problem solved, by use a newer version javac: [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 02:07 h [INFO] Finished at:

migration from Teradata to Spark SQL

2016-05-03 Thread Tapan Upadhyay
Hi, We are planning to move our adhoc queries from teradata to spark. We have huge volume of queries during the day. What is best way to go about it - 1) Read data directly from teradata db using spark jdbc 2) Import data using sqoop by EOD jobs into hive tables stored as parquet and then run

Re: Bit(N) on create Table with MSSQLServer

2016-05-03 Thread Mich Talebzadeh
Can you create the MSSQL (target) table first with the correct column setting and insert data from Spark to it with JDBC as opposed to JDBC creating target table itself? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

RE: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Mohammed Guller
You can run multiple Spark applications simultaneously. Just limit the # of cores and memory allocated to each application. For example, if each node has 8 cores and there are 10 nodes and you want to be able to run 4 applications simultaneously, limit the # of cores for each application to 20.

Re: Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Kevin Mellott
If you put this into a dataframe then you may be able to use one hot encoding and treat these as categorical features. I believe that the ml pipeline components use project tungsten so the performance will be very fast. After you process the result on the dataframe you would then need to assemble

Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Bibudh Lahiri
Hi, I have multiple procedure codes that a patient has undergone, in an RDD with a different row for each combination of patient and procedure. I am trying to covert this data to the LibSVM format, so that the resultant looks as follows: "0 1:1 2:0 3:1 29:1 30:1 32:1 110:1" where 1, 2, 3,

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks Cody, I can see that the partitions are well distributed... Then I'm in the process of using the direct api. On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger wrote: > 60 partitions in and of itself shouldn't be a big performance issue > (as long as producers are

Calculating log-loss for the trained model in Spark ML

2016-05-03 Thread Abhishek Anand
I am building a ML pipeline for logistic regression. val lr = new LogisticRegression() lr.setMaxIter(100).setRegParam(0.001) val pipeline = new Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder, devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,

Re: Error while running jar using spark-submit on another machine

2016-05-03 Thread nsalian
Thank you for the question. What is different on this machine as compared to the ones where the job succeeded? - Neelesh S. Salian Cloudera -- View this message in context:

Re: a question about --executor-cores

2016-05-03 Thread nsalian
Hello, Thank you for posting the question. To begin I do have a few questions. 1) What is size of the YARN installation? How many NodeManagers? 2) Notes to Remember: Container Virtual CPU Cores yarn.nodemanager.resource.cpu-vcores >> Number of virtual CPU cores that can be allocated for

Re: Creating new Spark context when running in Secure YARN fails

2016-05-03 Thread nsalian
Feel free to correct me if I am wrong. But I believe this isn't a feature yet: "create a new Spark context within a single JVM process (driver)" A few questions for you: 1) Is Kerberos setup correctly for you (the user) 2) Could you please add the command/ code you are executing? Checking to

Free memory while launching jobs.

2016-05-03 Thread mjordan79
I have a machine with an 8GB total memory, on which there are other applications installed. The Spark application must run 1 driver and two jobs at a time. I have configured 8 cores in total. The machine (without Spark) has 4GB of free RAM (the other half RAM is used by other applications). So I

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
Mike, It looks like you are right. The result seem to be fine. It looks like I messed up on the filtering clause. sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM

Free memory while launching jobs.

2016-05-03 Thread Renato Perini
I have a machine with an 8GB total memory, on which there are other applications installed. The Spark application must run 1 driver and two jobs at a time. I have configured 8 cores in total. The machine (without Spark) has 4GB of free RAM (the other half RAM is used by other applications).

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Cody Koeninger
60 partitions in and of itself shouldn't be a big performance issue (as long as producers are distributing across partitions evenly). On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams wrote: > Thanks again Cody. Regarding the details 66 kafka partitions on 3 > kafka servers,

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams
Thanks again Cody. Regarding the details 66 kafka partitions on 3 kafka servers, likely 8 core systems with 10 disks each. Maybe the issue with the receiver was the large number of partitions. I had miscounted the disks and so 11*3*2 is how I decided to partition my topic on insertion, ( by my

Re: removing header from csv file

2016-05-03 Thread Michael Segel
Hi, Another silly question… Don’t you want to use the header line to help create a schema for the RDD? Thx -Mike > On May 3, 2016, at 8:09 AM, Mathieu Longtin wrote: > > This only works if the files are "unsplittable". For example gzip files, each > partition is

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Michael Segel
Silly question? If you change the predicate to ( s.date >= ‘2016-01-03’ OR s.date IS NULL ) AND (d.date >= ‘2016-01-03’ OR d.date IS NULL) What do you get? Sorry if the syntax isn’t 100% correct. The idea is to not drop null values from the query. I would imagine that this shouldn’t

Spark 1.5.2 Shuffle Blocks - running out of memory

2016-05-03 Thread Nirav Patel
Hi, My spark application getting killed abruptly during a groupBy operation where shuffle happens. All shuffle happens with PROCESS_LOCAL locality. I see following in driver logs. Should not this logs be in executors? Anyhow looks like ByteBuffer is running out of memory. What will be workaround

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
Davies, What exactly do you mean in regards to Spark 2.0 turning these join into an inner join? Does this mean that spark sql won't be supporting where clauses in outer joins? Cesar & Gourav, When running the queries without the where clause it works as expected. I am pasting my results

Re: yarn-cluster

2016-05-03 Thread nsalian
Hello, Thank you for the question. The Status UNDEFINED means the application has not been completed and not been resourced. Upon getting assignment it will progress to RUNNING and then SUCCEEDED upon completion. It isn't a problem that you should worry about. You should make sure to tune your

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Davies Liu
Bingo, the two predicate s.date >= '2016-01-03' AND d.date >= '2016-01-03' is the root cause, which will filter out all the nulls from outer join, will have same result as inner join. In Spark 2.0, we turn these join into inner join actually. On Tue, May 3, 2016 at 9:50 AM, Cesar Flores

Re: how to orderBy previous groupBy.count.orderBy in pyspark

2016-05-03 Thread webe3vt
Here is what I ended up doing. Improvements are welcome. from pyspark.sql import SQLContext, Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql.functions import asc, desc, sum, count sqlContext = SQLContext(sc) error_schema = StructType([

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Cesar Flores
Hi Have you tried the joins without the where clause? When you use them you are filtering all the rows with null columns in those fields. In other words you are doing a inner join in all your queries. On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta wrote: > Hi

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Gourav Sengupta
Hi Kevin, Having given it a first look I do think that you have hit something here and this does not look quite fine. I have to work on the multiple AND conditions in ON and see whether that is causing any issues. Regards, Gourav Sengupta On Tue, May 3, 2016 at 8:28 AM, Kevin Peng

Re: Error from reading S3 in Scala

2016-05-03 Thread Gourav Sengupta
Hi, The best thing to do is start the EMR clusters with proper permissions in the roles that way you do not need to worry about the keys at all. Another thing, why are we using s3a// instead of s3:// ? Besides that you can increase s3 speeds using the instructions mentioned here:

--jars for mesos cluster

2016-05-03 Thread Alex Dzhagriev
Hello all, In the Mesos related spark docs ( http://spark.apache.org/docs/1.6.0/running-on-mesos.html#cluster-mode) I found this statement: Note that jars or python files that are passed to spark-submit should be > URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically >

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Andy Davidson
Hi Tobias I am very interested implemented rest based api on top of spark. My rest based system would make predictions from data provided in the request using models trained in batch. My SLA is 250 ms. Would you mind sharing how you implemented your rest server? I am using spark-1.6.1. I have

Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Tobias Eriksson
Hi We are using Spark for a long running job, in fact it is a REST-server that does some joins with some tables in Casandra and returns the result. Now we need to have multiple applications running in the same Spark cluster, and from what I understand this is not possible, or should I say

Re: Spark streaming app starts processing when kill that app

2016-05-03 Thread Shams ul Haque
Hey Hareesh, Thanks for the help, they were starving. I increased the core + memory on that machine. Now it is working fine. Thanks again On Tue, May 3, 2016 at 12:57 PM, Shams ul Haque wrote: > No, i made a cluster of 2 machines. And after submission to master, this >

unsubscribe

2016-05-03 Thread Rodrick Brown
unsubscribe \-- **Rodrick Brown** / Systems Engineer +1 917 445 6839 / [rodr...@orchardplatform.com](mailto:char...@orchardplatform.com) **Orchard Platform** 101 5th Avenue, 4th Floor, New York, NY 10003 [http://www.orchardplatform.com](http://www.orchardplatform.com/) [Orchard

Re: removing header from csv file

2016-05-03 Thread Mathieu Longtin
This only works if the files are "unsplittable". For example gzip files, each partition is one file (if you have more partitions than files), so the first line of each partition is the header. Spark-csv extensions reads the very first line of the RDD, assumes it's the header, and then filters

Re: parquet table in spark-sql

2016-05-03 Thread Sandeep Nemuri
We don't need any delimiters for Parquet file format. ᐧ On Tue, May 3, 2016 at 5:31 AM, Varadharajan Mukundan wrote: > Hi, > > Yes, it is not needed. Delimiters are need only for text files. > > On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote: > >> hi, I

[Spark 1.5.2] Spark dataframes vs sql query -performance parameter ?

2016-05-03 Thread Divya Gehlot
Hi, I am interested to know on which parameters we can say Spark data frames are better sql queries . Would be grateful ,If somebody can explain me with the usecases . Thanks, Divya

Re: parquet table in spark-sql

2016-05-03 Thread Varadharajan Mukundan
Hi, Yes, it is not needed. Delimiters are need only for text files. On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote: > hi, I want to ask a question about parquet table in spark-sql table. > > I think that parquet have schema information in its own file. > so you don't need define

Re: Error from reading S3 in Scala

2016-05-03 Thread Steve Loughran
don't put your secret in the URI, it'll only creep out in the logs. Use the specific properties coverd in http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html, which you can set in your spark context by prefixing them with spark.hadoop. you can also set the env vars,

Re: Reading from Amazon S3

2016-05-03 Thread Steve Loughran
On 2 May 2016, at 19:24, Gourav Sengupta > wrote: Jorn, what aspects are you speaking about ? My response was absolutely pertinent to Jinan because he will not even face the problem if he used Scala. So it was along the lines of

Re: Reading from Amazon S3

2016-05-03 Thread Steve Loughran
I'm going to start by letting you know two secret tools we use for diagnosing faults; one big data at work, the other a large RDBMS behind a web UI 1. Google 2. The search field in Apache JIRA Given this is a senior project, these foundational tools are something you are going to need to

parquet table in spark-sql

2016-05-03 Thread ??????
hi, I want to ask a question about parquet table in spark-sql table. I think that parquet have schema information in its own file. so you don't need define row separator and column separator in create-table DDL, like that: total_duration BigInt) ROW FORMAT DELIMITED FIELDS TERMINATED BY

Re: kafka direct streaming python API fromOffsets

2016-05-03 Thread Saisai Shao
I guess the problem is that py4j automatically translate the python int into java int or long according to the value of the data. If this value is small it will translate to java int, otherwise it will translate into java long. But in java code, the parameter must be long type, so that's the

Re: kafka direct streaming python API fromOffsets

2016-05-03 Thread Tigran Avanesov
Thank you, But now I have this error: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long My offsets are actually not big enough to be long. If I put bigger values, I have no such exception. For me looks like a bug. Any ideas for a workaround? Thank! On

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-03 Thread Mich Talebzadeh
which version of Spark are using? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 3 May 2016 at

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
Davies, Here is the code that I am typing into the spark-shell along with the results (my question is at the bottom): val dps = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("file:///home/ltu/dps_csv/") val swig =

Re: Spark streaming app starts processing when kill that app

2016-05-03 Thread Shams ul Haque
No, i made a cluster of 2 machines. And after submission to master, this app moves on slave machine for execution. Well i am going to give a try to your suggestion by running both on same machine. Thanks Shams On Tue, May 3, 2016 at 12:53 PM, hareesh makam wrote: > If

Re: removing header from csv file

2016-05-03 Thread Abhishek Anand
You can use this function to remove the header from your dataset(applicable to RDD) def dropHeader(data: RDD[String]): RDD[String] = { data.mapPartitionsWithIndex((idx, lines) => { if (idx == 0) { lines.drop(1) } lines }) } Abhi On Wed, Apr 27, 2016 at

Re: Spark streaming app starts processing when kill that app

2016-05-03 Thread hareesh makam
If you are running your master on a single core, it might be an issue of Starvation. assuming you are running it locally, try setting master to local[2] or higher. Check the first example at https://spark.apache.org/docs/latest/streaming-programming-guide.html - Hareesh On 3 May 2016 at 12:35,

Clear Threshold in Logistic Regression ML Pipeline

2016-05-03 Thread Abhishek Anand
Hi All, I am trying to build a logistic regression pipeline in ML. How can I clear the threshold which by default is 0.5. In mllib I am able to clear the threshold to get the raw predictions using model.clearThreshold() function. Regards, Abhi

Spark streaming app starts processing when kill that app

2016-05-03 Thread Shams ul Haque
Hi all, I am facing strange issue when running Spark Streaming app. What i was doing is, When i submit my app by *spark-submit *it works fine and also visible in Spark UI. But it doesn't process any data coming from kafka. And when i kill that app by pressing Ctrl + C on terminal, then it start

Submit job to spark cluster Error ErrorMonitor dropping message...

2016-05-03 Thread Tenghuan He
Hi I deploy a Spark cluster with a master and a worker the master and worker are both on a VMWare virtual machine, with 1G memory and 2 cores. master IP: 192.168.179.133 worker IP: 192.168.179.134 after execute sbin/start-all.sh, the master and the worker startup, visit

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Davies Liu
as @Gourav said, all the join with different join type show the same results, which meant that all the rows from left could match at least one row from right, all the rows from right could match at least one row from left, even the number of row from left does not equal that of right. This is