[no subject]

2016-06-09 Thread pooja mehta
Hi, How to use scala UDF with the help of Beeline client. With the help of spark shell, we register our UDF like this:- sqlcontext.udf.register(). What is the way to use UDF in beeline client. Thanks Pooja

Catalyst optimizer cpu/Io cost

2016-06-09 Thread Srinivasan Hariharan02
Hi,, How can I get spark sql query cpu and Io cost after optimizing for the best logical plan. Is there any api to retrieve this information?. If anyone point me to the code where actually cpu and Io cost computed in catalyst module. Regards, Srinivasan Hariharan +91-9940395830

Re: Saving Parquet files to S3

2016-06-09 Thread Takeshi Yamamuro
Hi, You'd better off `setting parquet.block.size`. // maropu On Thu, Jun 9, 2016 at 7:48 AM, Daniel Siegmann wrote: > I don't believe there's anyway to output files of a specific size. What > you can do is partition your data into a number of partitions such that the > amount of data they each

Re: Error while using checkpointing . Spark streaming 1.5.2- DStream checkpointing has been enabled but the DStreams with their functions are not serialisable

2016-06-09 Thread sandesh deshmane
Hi changed code like this but error continues: myUnionRdd.repartition(sparkNumberOfSlaves).foreachRDD( new Function, Void>() { private static final long serialVersionUID = 1L; @Override public Void call(JavaPairRDD v1) throws Exception { Map localMap = v1.collectAsMap(); long currentTim

Re: Spark 2.0 Streaming and Event Time

2016-06-09 Thread Chang Lim
Yes, now I realized that. I did exchanged emails with TD on this topic. The Microsoft presentation at Spark summit ("reorder" function) would be a good addition to Spark. Would this feature be on the road map? On Thu, Jun 9, 2016 at 9:56 AM, Michael Armbrust wrote: > There is no special settin

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Mich Talebzadeh
cam you provide a code snippet of how you are populating the target table from temp table. HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

SparkR : glm model

2016-06-09 Thread april_ZMQ
Hi all, I'm a student who are working on a data analysis project with sparkR. I found out that GLM (generalized linear model) only supports two types of distribution, "gaussian" and "binomial". However, our project is requiring the "poisson" distribution. Meanwhile, I found out that sparkR was

Error writing parquet to S3

2016-06-09 Thread Peter Halliday
I’m not 100% sure why I’m getting this. I don’t see any errors before this at all. I’m not sure how to diagnose this. Peter Halliday 2016-06-10 01:46:05,282] WARN org.apache.spark.scheduler.TaskSetManager [task-result-getter-2hread] - Lost task 3737.0 in stage 2.0 (TID 10585, ip-172-16-96

Re: pyspark.GroupedData.agg works incorrectly when one column is aggregated twice?

2016-06-09 Thread Davies Liu
This one works as expected: ``` >>> spark.range(10).selectExpr("id", "id as k").groupBy("k").agg({"k": "count", >>> "id": "sum"}).show() +---++---+ | k|count(k)|sum(id)| +---++---+ | 0| 1| 0| | 7| 1| 7| | 6| 1| 6| | 9| 1| 9|

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
No, I am reading the data from hdfs, transforming it , registering the data in a temp table using registerTempTable and then doing insert overwrite using Spark SQl' hiveContext. On Thu, Jun 9, 2016 at 3:40 PM, Mich Talebzadeh wrote: > how are you doing the insert? from an existing table? > > Dr

Re: JIRA SPARK-2984

2016-06-09 Thread Sunil Kumar
Since I couldnt find a related PR, I have raised JIRA-15849  Thx On Thursday, June 9, 2016 11:07 AM, Holden Karau wrote: I'd do some searching and see if there is a JIRA related to this problem on s3 and if you don't find one go ahead and make one. Even if it is an intrinsic problem w

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Mich Talebzadeh
how are you doing the insert? from an existing table? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread swetha kasireddy
400 cores are assigned to this job. On Thu, Jun 9, 2016 at 1:16 PM, Stephen Boesch wrote: > How many workers (/cpu cores) are assigned to this job? > > 2016-06-09 13:01 GMT-07:00 SRK : > >> Hi, >> >> How to insert data into 2000 partitions(directories) of ORC/parquet at a >> time using Spark SQ

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gavin Yue
Could you print out the sql execution plan? My guess is about broadcast join. > On Jun 9, 2016, at 07:14, Gourav Sengupta wrote: > > Hi, > > Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and > is there a way we can optimize the queries in SPARK without the obviou

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh
I still fail to see how Hive can do orders of magnitude faster compared to Spark. Assuming that Hive is using map-reduce, I cannot see a real case for Hive to do faster than at least under normal operations Don't take me wrong. I am a fan of Hive. The performance of Hive comes from deploying the

Issues when using the streaming checkpoint

2016-06-09 Thread Natu Lauchande
Hi, I am having the following error when using checkpoint in a spark streamming app : java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serializable I am following the example available in https://github.com/apache/spark/blob/m

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
Hi Stephen, How can a single gzipped CSV file be partitioned and who partitions tables based on Primary Key in Hive? If you read the environments section you will be able to see that all the required details are mentioned. As far as I understand that Hive does work 25x faster (in these particul

Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread Stephen Boesch
How many workers (/cpu cores) are assigned to this job? 2016-06-09 13:01 GMT-07:00 SRK : > Hi, > > How to insert data into 2000 partitions(directories) of ORC/parquet at a > time using Spark SQL? It seems to be not performant when I try to insert > 2000 directories of Parquet/ORC using Spark SQL

how to store results of Scala Query in Text format or tab delimiter

2016-06-09 Thread Mahender Sarangam
Hi, We are newbies learning spark. We are running Scala query against our Parquet table. Whenever we fire query in Jupyter, results are shown in page, Only part of results are shown in UI. So we are trying to store the results into table which is Parquet format. By default, In Spark all the ta

How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-09 Thread SRK
Hi, How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL? It seems to be not performant when I try to insert 2000 directories of Parquet/ORC using Spark SQL. Did anyone face this issue? Thanks! -- View this message in context: http://apache-spark-user

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-09 Thread Mich Talebzadeh
Hi Rutuja, I am not certain whether such tool exists or not, However, opening a JIRA may be beneficial and would not do any harm. You may look for workaround. Now my understanding is that your need is for monitoring the health of the cluster? HTH Dr Mich Talebzadeh LinkedIn * https://www.li

Re: [ Standalone Spark Cluster ] - Track node status

2016-06-09 Thread Rutuja Kulkarni
Thanks again Mich! If there does not exist any interface like REST API or CLI for this, I would like to open a JIRA on exposing such a REST interface in SPARK which would list all the worker nodes. Please let me know if this seems to be the right thing to do for the community. Regards, Rutuja Kul

Re: Spark Streaming heap space out of memory

2016-06-09 Thread christian.dancu...@rbc.com
Issue was resolved by upgrading Spark to version 1.6 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-heap-space-out-of-memory-tp27050p27131.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: Processing Time Spikes (Spark Streaming)

2016-06-09 Thread christian.dancu...@rbc.com
What version of Spark are you running? Do you see the heap space slowly increase over time? Have you set the ttl cleaner? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Processing-Time-Spikes-Spark-Streaming-tp22375p27130.html Sent from the Apache Spark

Re: Variable in UpdateStateByKey Not Updating After Restarting Application from Checkpoint

2016-06-09 Thread Joe Panciera
Could it be possible that this is a bug? I hate to throw that word around, but this is definitely not expected behavior (as far as I can tell). If anyone has a suggestion for a work around or better way to accomplish handling a global value in UpdateStateByKey, that would be fantastic. Thanks On

Re: Error while using checkpointing . Spark streaming 1.5.2- DStream checkpointing has been enabled but the DStreams with their functions are not serialisable

2016-06-09 Thread Tathagata Das
myFunction() is probably capturing unexpected things in the closure of the Function you have defined, because myFunction is defined outside. Try defining the myFunction inside the Function and see if the problem persists. On Thu, Jun 9, 2016 at 3:57 AM, sandesh deshmane wrote: > Hi, > > I am usi

Re: Spark Streaming getting slower

2016-06-09 Thread John Simon
Sorry, forgot to mention that I don't use broadcast variables. That's why I'm puzzled here. -- John Simon On Thu, Jun 9, 2016 at 11:09 AM, John Simon wrote: > Hi, > > I'm running Spark Streaming with Kafka Direct Stream, batch interval > is 10 seconds. > After running about 72 hours, the batch p

Spark Streaming getting slower

2016-06-09 Thread John Simon
Hi, I'm running Spark Streaming with Kafka Direct Stream, batch interval is 10 seconds. After running about 72 hours, the batch processing time almost doubles. I didn't find anything wrong on JVM GC logs, but I did find that broadcast variable reading time increasing, like this: initially: ``` 1

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I'd do some searching and see if there is a JIRA related to this problem on s3 and if you don't find one go ahead and make one. Even if it is an intrinsic problem with s3 (and I'm not super sure since I'm just reading this on mobile) - it would maybe be a good thing for us to document. On Thursday

Re: JIRA SPARK-2984

2016-06-09 Thread Sunil Kumar
Holden Thanks for your prompt reply... Any suggestions on the next step ? Does this call for a new spark jira ticket or is this an issue for s3? Thx

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I think your error could possibly be different - looking at the original JIRA the issue was happening on HDFS and you seem to be experiencing the issue on s3n, and while I don't have full view of the problem I could see this being s3 specific (read-after-write on s3 is trickier than read-after-writ

Re: Spark ML - Is it safe to schedule two trainings job at the same time or will worker state be corrupted?

2016-06-09 Thread Jacek Laskowski
Hi, It's supposed to work like this - share SparkContext to share datasets between threads. Ad 1. No Ad 2. Yes See CrossValidation and similar validations in spark.ml. Jacek On 9 Jun 2016 7:29 p.m., "Brandon White" wrote: > For example, say I want to train two Linear Regressions and two GBD T

Spark ML - Is it safe to schedule two trainings job at the same time or will worker state be corrupted?

2016-06-09 Thread Brandon White
For example, say I want to train two Linear Regressions and two GBD Tree Regressions. Using different threads, Spark allows you to submit jobs at the same time (see: http://spark.apache.org/docs/latest/job-scheduling.html). If I schedule two or more training jobs and they are running at the same t

Re: Spark 2.0 Streaming and Event Time

2016-06-09 Thread Michael Armbrust
There is no special setting for event time (though we will be adding one for setting a watermark in 2.1 to allow us to reduce the amount of state that needs to be kept around). Just window/groupBy on the on the column that is your event time. On Wed, Jun 8, 2016 at 4:12 PM, Chang Lim wrote: > H

Re: data frame or RDD for machine learning

2016-06-09 Thread Sandeep Nemuri
Please refer : http://spark.apache.org/docs/latest/mllib-guide.html ~Sandeep On Thursday 9 June 2016, Jacek Laskowski wrote: > Hi, > > Use DataFrame-based API (aka spark.ml) first and if your ml algorithm > doesn't support it switch to a RDD-based API (spark.mllib). What algorithm > are you goi

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-09 Thread Jacek Laskowski
Makes sense. Thanks Michael (and welcome back from #SparkSummit!) On to exploring the space... Jacek On 9 Jun 2016 6:10 p.m., "Michael Armbrust" wrote: > Look at the explain(). For a Seq we know its just local data so avoid > spark jobs for simple operations. In contrast, an RDD is opaque to >

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Stephen Boesch
ooc are the tables partitioned on a.pk and b.fk? Hive might be using copartitioning in that case: it is one of hive's strengths. 2016-06-09 7:28 GMT-07:00 Gourav Sengupta : > Hi Mich, > > does not Hive use map-reduce? I thought it to be so. And since I am > running the queries in EMR 4.6 therefo

Re: Seq.toDF vs sc.parallelize.toDF = no Spark job vs one - why?

2016-06-09 Thread Michael Armbrust
Look at the explain(). For a Seq we know its just local data so avoid spark jobs for simple operations. In contrast, an RDD is opaque to catalyst so we can't perform that optimization. On Wed, Jun 8, 2016 at 7:49 AM, Jacek Laskowski wrote: > Hi, > > I just noticed it today while toying with Sp

JIRA SPARK-2984

2016-06-09 Thread Sunil Kumar
Hi, I am running into SPARK-2984 while running my spark 1.6.1 jobs over yarn in AWS. I have tried with spark.speculation=false but still see the same failure with _temporary file missing for task_xxx...This ticket is in resolved state. How can this be reopened ? Is there a workaround ? thanks

Re: Apache design patterns

2016-06-09 Thread Daniel Siegmann
On Tue, Jun 7, 2016 at 11:43 PM, Francois Le Roux wrote: > 1. Should I use dataframes to ‘pull the source data? If so, do I do > a groupby and order by as part of the SQL query? > Seems reasonable. If you use Scala you might want to define a case class and convert the data frame to a datase

Re: Saving Parquet files to S3

2016-06-09 Thread Daniel Siegmann
I don't believe there's anyway to output files of a specific size. What you can do is partition your data into a number of partitions such that the amount of data they each contain is around 1 GB. On Thu, Jun 9, 2016 at 7:51 AM, Ankur Jain wrote: > Hello Team, > > > > I want to write parquet fil

Re: Trainning a spark ml linear regresion model fail after migrating from 1.5.2 to 1.6.1

2016-06-09 Thread chaz2505
I ran into this problem too - it's because WeightedLeastSquares (added in 1.6.0 SPARK-10668) is being used on an ill-conditioned problem (SPARK-11918). I guess because of the one hot encoding. To get around it you need to ensure WeightedLeastSquares isn't used. Set parameters to make the following

Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-09 Thread Ted Yu
bq. Read data from hbase using custom DefaultSource (implemented using TableScan) Did you use the DefaultSource from hbase-spark module in hbase master branch ? If you wrote your own, mind sharing related code ? Thanks On Thu, Jun 9, 2016 at 2:53 AM, raaggarw wrote: > Hi, > > I was trying to p

spark on mesos cluster - metrics with graphite sink

2016-06-09 Thread Lior Chaga
Hi, I'm launching spark application on mesos cluster. The namespace of the metric includes the framework id for driver metrics, and both framework id and executor id for executor metrics. These ids are obviously assigned by mesos, and they are not permanent - re-registering the application would re

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
Hi Mich, does not Hive use map-reduce? I thought it to be so. And since I am running the queries in EMR 4.6 therefore HIVE is not using TEZ. Regards, Gourav On Thu, Jun 9, 2016 at 3:25 PM, Mich Talebzadeh wrote: > are you using map-reduce with Hive? > > Dr Mich Talebzadeh > > > > LinkedIn *

Re: HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Mich Talebzadeh
are you using map-reduce with Hive? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 9 June 2016 at 15:

HIVE Query 25x faster than SPARK Query

2016-06-09 Thread Gourav Sengupta
Hi, Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and is there a way we can optimize the queries in SPARK without the obvious hack in Query2. --- ENVIRONMENT: --- > Table A 533 columns x 24 million rows and Table B has 2 column

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-09 Thread Gourav Sengupta
Hi, are you using EC2 instances or local cluster behind firewall. Regards, Gourav Sengupta On Wed, Jun 8, 2016 at 4:34 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > > I'm trying to create a table on s3a but I keep hitting the following error: > > Exception in thread "main"

Saving Parquet files to S3

2016-06-09 Thread Ankur Jain
Hello Team, I want to write parquet files to AWS S3, but I want to size each file size to 1 GB. Can someone please guide me on how I can achieve the same? I am using AWS EMR with spark 1.6.1. Thanks, Ankur Information transmitted by this e-mail is proprietary to YASH Technologies and/ or its C

Spark ML Word2Vec Serialization Issues

2016-06-09 Thread sharad82
http://stackoverflow.com/questions/37723308/spark-ml-word2vec-serialization-issues I recently refactored our Word2Vec code to move to DataFrame based ml models, but I am having problem in serializing and loadin

Error while using checkpointing . Spark streaming 1.5.2- DStream checkpointing has been enabled but the DStreams with their functions are not serialisable

2016-06-09 Thread sandesh deshmane
Hi, I am using spark streaming for streaming data from kafka 0.8 I am using checkpointing in HDFS . I am getting error like below java.io.NotSerializableException: DStream checkpointing has been enabled but the DStreams with their functions are not serialisable field (class: org.apache.spark.st

OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-09 Thread raaggarw
Hi, I was trying to port my code from spark 1.5.2 to spark 2.0 however i faced some outofMemory issues. On drilling down i could see that OOM is because of join, because removing join fixes the issue. I then created a small spark-app to reproduce this: (48 cores, 300gb ram - divided among 4 worke

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-09 Thread Steve Loughran
On 9 Jun 2016, at 06:17, Daniel Haviv mailto:daniel.ha...@veracity-group.com>> wrote: Hi, I've set these properties both in core-site.xml and hdfs-site.xml with no luck. Thank you. Daniel That's not good. I'm afraid I don't know what version of s3a is in the cloudera release —I can see that

Fwd: Sequential computation over several partitions

2016-06-09 Thread Jeroen Miller
Hello, On Wed, Jun 8, 2016 at 12:59 AM, Mich Talebzadeh wrote: > > one thing you may consider is using something like flume to store > data on hfs. [...] Thank you for your sensible suggestions. > Have you thought of other tools besides Spark? No, as least not seriously yet. Flume looks like a

Re: Spark Streaming stateful operation to HBase

2016-06-09 Thread Jacek Laskowski
Hi, Check the number of records inside the DStream at a batch before you do the save. Gist the code with mapWithState and save? Jacek On 9 Jun 2016 7:58 a.m., "soumick dasgupta" wrote: Hi, I am using mapwithstate to keep the state and then ouput the result to HBase. The problem I am facing is

Re: Spark Partition by Columns doesn't work properly

2016-06-09 Thread Chanh Le
Ok, thanks. On Thu, Jun 9, 2016, 12:51 PM Jasleen Kaur wrote: > The github repo is https://github.com/datastax/spark-cassandra-connector > > The talk video and slides should be uploaded soon on spark summit website > > > On Wednesday, June 8, 2016, Chanh Le wrote: > >> Thanks, I'll look into it

Re: data frame or RDD for machine learning

2016-06-09 Thread Jacek Laskowski
Hi, Use DataFrame-based API (aka spark.ml) first and if your ml algorithm doesn't support it switch to a RDD-based API (spark.mllib). What algorithm are you going to use? Jacek On 9 Jun 2016 9:12 a.m., "pseudo oduesp" wrote: > Hi, > after spark 1.3 we have dataframe ( thanks good ) , instead

data frame or RDD for machine learning

2016-06-09 Thread pseudo oduesp
Hi, after spark 1.3 we have dataframe ( thanks good ) , instead rdd : in machine learning algorithmes we should give him an RDD or dataframe? i mean when i build modele : Model = algoritme(rdd) or Model = algorithme(df) if you have an exemple with data frame i prefer work with