date:20160503

Re: migration from Teradata to Spark SQL

2016-05-03 Thread Deepak Sharma

Hi Tapan I would suggest an architecture where you have different storage layer and data servng layer. Spark is still best for batch processing of data. So what i am suggesting here is you can have your data stored as it is in some hdfs raw layer , run your ELT in spark on this raw data and

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Divya Gehlot

Hi , Even I am getting the similar error Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile When I tried to build Phoenix Project using maven . Maven version : 3.3 Java version - 1.7_67 Phoenix - downloaded latest master from Git hub If anybody find the the resolution

回复： parquet table in spark-sql

2016-05-03 Thread 喜之郎

thanks for your answer, Sandeep . And also thanks，Varadharajan. -- 原始邮件 -- 发件人: "Sandeep Nemuri";; 发送时间: 2016年5月3日(星期二) 晚上8:48 收件人: "Varadharajan Mukundan"; 抄送: "喜之郎"<251922...@qq.com>; "user";

Re: spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread Ted Yu

Which version are you using now ? I wonder if 1.8.0_91 had problem. Cheers On Tue, May 3, 2016 at 6:29 PM, sunday2000 <2314476...@qq.com> wrote: > Problem solved, by use a newer version javac: > [INFO] > > [INFO] BUILD

?????? spark 1.6.1 build failure of : scala-maven-plugin

2016-05-03 Thread sunday2000

Problem solved, by use a newer version javac: [INFO] [INFO] BUILD SUCCESS [INFO] [INFO] Total time: 02:07 h [INFO] Finished at:

migration from Teradata to Spark SQL

2016-05-03 Thread Tapan Upadhyay

Hi, We are planning to move our adhoc queries from teradata to spark. We have huge volume of queries during the day. What is best way to go about it - 1) Read data directly from teradata db using spark jdbc 2) Import data using sqoop by EOD jobs into hive tables stored as parquet and then run

Re: Bit(N) on create Table with MSSQLServer

2016-05-03 Thread Mich Talebzadeh

Can you create the MSSQL (target) table first with the correct column setting and insert data from Spark to it with JDBC as opposed to JDBC creating target table itself? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

RE: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Mohammed Guller

You can run multiple Spark applications simultaneously. Just limit the # of cores and memory allocated to each application. For example, if each node has 8 cores and there are 10 nodes and you want to be able to run 4 applications simultaneously, limit the # of cores for each application to 20.

Re: Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Kevin Mellott

If you put this into a dataframe then you may be able to use one hot encoding and treat these as categorical features. I believe that the ml pipeline components use project tungsten so the performance will be very fast. After you process the result on the dataframe you would then need to assemble

Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Bibudh Lahiri

Hi, I have multiple procedure codes that a patient has undergone, in an RDD with a different row for each combination of patient and procedure. I am trying to covert this data to the LibSVM format, so that the resultant looks as follows: "0 1:1 2:0 3:1 29:1 30:1 32:1 110:1" where 1, 2, 3,

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams

Thanks Cody, I can see that the partitions are well distributed... Then I'm in the process of using the direct api. On Tue, May 3, 2016 at 6:51 PM, Cody Koeninger wrote: > 60 partitions in and of itself shouldn't be a big performance issue > (as long as producers are

Calculating log-loss for the trained model in Spark ML

2016-05-03 Thread Abhishek Anand

I am building a ML pipeline for logistic regression. val lr = new LogisticRegression() lr.setMaxIter(100).setRegParam(0.001) val pipeline = new Pipeline().setStages(Array(geoDimEncoder,clientTypeEncoder, devTypeDimIdEncoder,pubClientIdEncoder,tmpltIdEncoder,

Re: Error while running jar using spark-submit on another machine

2016-05-03 Thread nsalian

Thank you for the question. What is different on this machine as compared to the ones where the job succeeded? - Neelesh S. Salian Cloudera -- View this message in context:

Re: a question about --executor-cores

2016-05-03 Thread nsalian

Hello, Thank you for posting the question. To begin I do have a few questions. 1) What is size of the YARN installation? How many NodeManagers? 2) Notes to Remember: Container Virtual CPU Cores yarn.nodemanager.resource.cpu-vcores >> Number of virtual CPU cores that can be allocated for

Re: Creating new Spark context when running in Secure YARN fails

2016-05-03 Thread nsalian

Feel free to correct me if I am wrong. But I believe this isn't a feature yet: "create a new Spark context within a single JVM process (driver)" A few questions for you: 1) Is Kerberos setup correctly for you (the user) 2) Could you please add the command/ code you are executing? Checking to

Free memory while launching jobs.

2016-05-03 Thread mjordan79

I have a machine with an 8GB total memory, on which there are other applications installed. The Spark application must run 1 driver and two jobs at a time. I have configured 8 cores in total. The machine (without Spark) has 4GB of free RAM (the other half RAM is used by other applications). So I

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng

Mike, It looks like you are right. The result seem to be fine. It looks like I messed up on the filtering clause. sqlContext.sql("SELECT s.date AS edate , s.account AS s_acc , d.account AS d_acc , s.ad as s_ad , d.ad as d_ad , s.spend AS s_spend , d.spend_in_dollar AS d_spend FROM

Free memory while launching jobs.

2016-05-03 Thread Renato Perini

I have a machine with an 8GB total memory, on which there are other applications installed. The Spark application must run 1 driver and two jobs at a time. I have configured 8 cores in total. The machine (without Spark) has 4GB of free RAM (the other half RAM is used by other applications).

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Cody Koeninger

60 partitions in and of itself shouldn't be a big performance issue (as long as producers are distributing across partitions evenly). On Tue, May 3, 2016 at 1:44 PM, Colin Kincaid Williams wrote: > Thanks again Cody. Regarding the details 66 kafka partitions on 3 > kafka servers,

Re: Improving performance of a kafka spark streaming app

2016-05-03 Thread Colin Kincaid Williams

Thanks again Cody. Regarding the details 66 kafka partitions on 3 kafka servers, likely 8 core systems with 10 disks each. Maybe the issue with the receiver was the large number of partitions. I had miscounted the disks and so 11*3*2 is how I decided to partition my topic on insertion, ( by my

Re: removing header from csv file

2016-05-03 Thread Michael Segel

Hi, Another silly question… Don’t you want to use the header line to help create a schema for the RDD? Thx -Mike > On May 3, 2016, at 8:09 AM, Mathieu Longtin wrote: > > This only works if the files are "unsplittable". For example gzip files, each > partition is

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Michael Segel

Silly question? If you change the predicate to ( s.date >= ‘2016-01-03’ OR s.date IS NULL ) AND (d.date >= ‘2016-01-03’ OR d.date IS NULL) What do you get? Sorry if the syntax isn’t 100% correct. The idea is to not drop null values from the query. I would imagine that this shouldn’t

Spark 1.5.2 Shuffle Blocks - running out of memory

2016-05-03 Thread Nirav Patel

Hi, My spark application getting killed abruptly during a groupBy operation where shuffle happens. All shuffle happens with PROCESS_LOCAL locality. I see following in driver logs. Should not this logs be in executors? Anyhow looks like ByteBuffer is running out of memory. What will be workaround

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng

Davies, What exactly do you mean in regards to Spark 2.0 turning these join into an inner join? Does this mean that spark sql won't be supporting where clauses in outer joins? Cesar & Gourav, When running the queries without the where clause it works as expected. I am pasting my results

Re: yarn-cluster

2016-05-03 Thread nsalian

Hello, Thank you for the question. The Status UNDEFINED means the application has not been completed and not been resourced. Upon getting assignment it will progress to RUNNING and then SUCCEEDED upon completion. It isn't a problem that you should worry about. You should make sure to tune your

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Davies Liu

Bingo, the two predicate s.date >= '2016-01-03' AND d.date >= '2016-01-03' is the root cause, which will filter out all the nulls from outer join, will have same result as inner join. In Spark 2.0, we turn these join into inner join actually. On Tue, May 3, 2016 at 9:50 AM, Cesar Flores

Re: how to orderBy previous groupBy.count.orderBy in pyspark

2016-05-03 Thread webe3vt

Here is what I ended up doing. Improvements are welcome. from pyspark.sql import SQLContext, Row from pyspark.sql.types import StructType, StructField, IntegerType, StringType from pyspark.sql.functions import asc, desc, sum, count sqlContext = SQLContext(sc) error_schema = StructType([

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Cesar Flores

Hi Have you tried the joins without the where clause? When you use them you are filtering all the rows with null columns in those fields. In other words you are doing a inner join in all your queries. On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta wrote: > Hi

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Gourav Sengupta

Hi Kevin, Having given it a first look I do think that you have hit something here and this does not look quite fine. I have to work on the multiple AND conditions in ON and see whether that is causing any issues. Regards, Gourav Sengupta On Tue, May 3, 2016 at 8:28 AM, Kevin Peng

Re: Error from reading S3 in Scala

2016-05-03 Thread Gourav Sengupta

Hi, The best thing to do is start the EMR clusters with proper permissions in the roles that way you do not need to worry about the keys at all. Another thing, why are we using s3a// instead of s3:// ? Besides that you can increase s3 speeds using the instructions mentioned here:

--jars for mesos cluster

2016-05-03 Thread Alex Dzhagriev

Hello all, In the Mesos related spark docs ( http://spark.apache.org/docs/1.6.0/running-on-mesos.html#cluster-mode) I found this statement: Note that jars or python files that are passed to spark-submit should be > URIs reachable by Mesos slaves, as the Spark driver doesn’t automatically >

Re: Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Andy Davidson

Hi Tobias I am very interested implemented rest based api on top of spark. My rest based system would make predictions from data provided in the request using models trained in batch. My SLA is 250 ms. Would you mind sharing how you implemented your rest server? I am using spark-1.6.1. I have

Multiple Spark Applications that use Cassandra, how to share resources/nodes

2016-05-03 Thread Tobias Eriksson

Hi We are using Spark for a long running job, in fact it is a REST-server that does some joins with some tables in Casandra and returns the result. Now we need to have multiple applications running in the same Spark cluster, and from what I understand this is not possible, or should I say

Re: Spark streaming app starts processing when kill that app

2016-05-03 Thread Shams ul Haque

Hey Hareesh, Thanks for the help, they were starving. I increased the core + memory on that machine. Now it is working fine. Thanks again On Tue, May 3, 2016 at 12:57 PM, Shams ul Haque wrote: > No, i made a cluster of 2 machines. And after submission to master, this >

unsubscribe

2016-05-03 Thread Rodrick Brown

unsubscribe \-- **Rodrick Brown** / Systems Engineer +1 917 445 6839 / [rodr...@orchardplatform.com](mailto:char...@orchardplatform.com) **Orchard Platform** 101 5th Avenue, 4th Floor, New York, NY 10003 [http://www.orchardplatform.com](http://www.orchardplatform.com/) [Orchard

Re: removing header from csv file

2016-05-03 Thread Mathieu Longtin

This only works if the files are "unsplittable". For example gzip files, each partition is one file (if you have more partitions than files), so the first line of each partition is the header. Spark-csv extensions reads the very first line of the RDD, assumes it's the header, and then filters

Re: parquet table in spark-sql

2016-05-03 Thread Sandeep Nemuri

We don't need any delimiters for Parquet file format. ᐧ On Tue, May 3, 2016 at 5:31 AM, Varadharajan Mukundan wrote: > Hi, > > Yes, it is not needed. Delimiters are need only for text files. > > On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote: > >> hi, I

[Spark 1.5.2] Spark dataframes vs sql query -performance parameter ?

2016-05-03 Thread Divya Gehlot

Hi, I am interested to know on which parameters we can say Spark data frames are better sql queries . Would be grateful ,If somebody can explain me with the usecases . Thanks, Divya

Re: parquet table in spark-sql

2016-05-03 Thread Varadharajan Mukundan

Hi, Yes, it is not needed. Delimiters are need only for text files. On Tue, May 3, 2016 at 12:49 PM, 喜之郎 <251922...@qq.com> wrote: > hi, I want to ask a question about parquet table in spark-sql table. > > I think that parquet have schema information in its own file. > so you don't need define

Re: Error from reading S3 in Scala

2016-05-03 Thread Steve Loughran

don't put your secret in the URI, it'll only creep out in the logs. Use the specific properties coverd in http://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html, which you can set in your spark context by prefixing them with spark.hadoop. you can also set the env vars,

Re: Reading from Amazon S3

2016-05-03 Thread Steve Loughran

On 2 May 2016, at 19:24, Gourav Sengupta > wrote: Jorn, what aspects are you speaking about ? My response was absolutely pertinent to Jinan because he will not even face the problem if he used Scala. So it was along the lines of

Re: Reading from Amazon S3

2016-05-03 Thread Steve Loughran

I'm going to start by letting you know two secret tools we use for diagnosing faults; one big data at work, the other a large RDBMS behind a web UI 1. Google 2. The search field in Apache JIRA Given this is a senior project, these foundational tools are something you are going to need to

parquet table in spark-sql

2016-05-03 Thread ??????

hi, I want to ask a question about parquet table in spark-sql table. I think that parquet have schema information in its own file. so you don't need define row separator and column separator in create-table DDL, like that: total_duration BigInt) ROW FORMAT DELIMITED FIELDS TERMINATED BY

Re: kafka direct streaming python API fromOffsets

2016-05-03 Thread Saisai Shao

I guess the problem is that py4j automatically translate the python int into java int or long according to the value of the data. If this value is small it will translate to java int, otherwise it will translate into java long. But in java code, the parameter must be long type, so that's the

Re: kafka direct streaming python API fromOffsets

2016-05-03 Thread Tigran Avanesov

Thank you, But now I have this error: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Long My offsets are actually not big enough to be long. If I put bigger values, I have no such exception. For me looks like a bug. Any ideas for a workaround? Thank! On

Re: Spark build failure with com.oracle:ojdbc6:jar:11.2.0.1.0

2016-05-03 Thread Mich Talebzadeh

which version of Spark are using? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 3 May 2016 at

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng

Davies, Here is the code that I am typing into the spark-shell along with the results (my question is at the bottom): val dps = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("file:///home/ltu/dps_csv/") val swig =

Re: Spark streaming app starts processing when kill that app

2016-05-03 Thread Shams ul Haque

No, i made a cluster of 2 machines. And after submission to master, this app moves on slave machine for execution. Well i am going to give a try to your suggestion by running both on same machine. Thanks Shams On Tue, May 3, 2016 at 12:53 PM, hareesh makam wrote: > If

Re: removing header from csv file

2016-05-03 Thread Abhishek Anand

You can use this function to remove the header from your dataset(applicable to RDD) def dropHeader(data: RDD[String]): RDD[String] = { data.mapPartitionsWithIndex((idx, lines) => { if (idx == 0) { lines.drop(1) } lines }) } Abhi On Wed, Apr 27, 2016 at

Re: Spark streaming app starts processing when kill that app

2016-05-03 Thread hareesh makam

If you are running your master on a single core, it might be an issue of Starvation. assuming you are running it locally, try setting master to local[2] or higher. Check the first example at https://spark.apache.org/docs/latest/streaming-programming-guide.html - Hareesh On 3 May 2016 at 12:35,

Clear Threshold in Logistic Regression ML Pipeline

2016-05-03 Thread Abhishek Anand

Hi All, I am trying to build a logistic regression pipeline in ML. How can I clear the threshold which by default is 0.5. In mllib I am able to clear the threshold to get the raw predictions using model.clearThreshold() function. Regards, Abhi

Spark streaming app starts processing when kill that app

2016-05-03 Thread Shams ul Haque

Hi all, I am facing strange issue when running Spark Streaming app. What i was doing is, When i submit my app by *spark-submit *it works fine and also visible in Spark UI. But it doesn't process any data coming from kafka. And when i kill that app by pressing Ctrl + C on terminal, then it start

Submit job to spark cluster Error ErrorMonitor dropping message...

2016-05-03 Thread Tenghuan He

Hi I deploy a Spark cluster with a master and a worker the master and worker are both on a VMWare virtual machine, with 1G memory and 2 cores. master IP: 192.168.179.133 worker IP: 192.168.179.134 after execute sbin/start-all.sh, the master and the worker startup, visit

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Davies Liu

as @Gourav said, all the join with different join type show the same results, which meant that all the rows from left could match at least one row from right, all the rows from right could match at least one row from left, even the number of row from left does not equal that of right. This is

54 matches

Mail list logo