How to get a spark sql statement implement duration ?

2017-02-06 Thread Mars Xu
Hello All, Some spark sqls will produce one or more jobs, I have 2 questions, 1, How the cc.sql(“sql statement”) divided into one or more jobs ? 2, When I execute spark sql query in spark - shell client, how to get the execution time (Spark 2.1.0) ? if a sql query

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
I presume you may be able to implement a custom sink and use df.saveAsTable. The problem is that you will have to handle idempotence / garbage collection yourself, in case your job fails while writing, etc. On Mon, Feb 6, 2017 at 5:53 PM, Egor Pahomov wrote: > I have

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Egor Pahomov
I have stream of files on HDFS with JSON events. I need to convert it to pq in realtime, process some fields and store in simple Hive table so people can query it. People even might want to query it with Impala, so it's important, that it would be real Hive metastore based table. How can I do

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Jon Gregg
Strange that it's working for some directories but not others. Looks like wholeTextFiles maybe doesn't work with S3? https://issues.apache.org/jira/browse/SPARK-4414 . If it's possible to load the data into EMR and run Spark from there that may be a workaround. This blogspot shows a python

Re: wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
I've actually been able to trace the problem to the files being read in. If I change to a different directory, then I don't get the error. Is one of the executors running out of memory? On 02/06/2017 02:35 PM, Paul Tremblay wrote: When I try to create an rdd using wholeTextFiles, I get an

wholeTextFiles fails, but textFile succeeds for same path

2017-02-06 Thread Paul Tremblay
When I try to create an rdd using wholeTextFiles, I get an incomprehensible error. But when I use the same path with sc.textFile, I get no error. I am using pyspark with spark 2.1. in_path = 's3://commoncrawl/crawl-data/CC-MAIN-2016-50/segments/1480698542939.6/warc/ rdd =

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
Hi Egor, Structured Streaming handles all of its metadata itself, which files are actually valid, etc. You may use the "create table" syntax in SQL to treat it like a hive table, but it will handle all partitioning information in its own metadata log. Is there a specific reason that you want to

Re: SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-06 Thread Marco Mistroni
My bad! Confused myself with different build.sbt I tried in different projects Thx Cody for pointing that out(again) Spark streaming Kafka was all I needed Kr On 6 Feb 2017 9:02 pm, "Cody Koeninger" wrote: > You should not need to include jars for Kafka, the spark connectors

Re: SSpark streaming: Could not initialize class kafka.consumer.FetchRequestAndResponseStatsRegistry$

2017-02-06 Thread Cody Koeninger
You should not need to include jars for Kafka, the spark connectors have the appropriate transitive dependency on the correct version. On Sat, Feb 4, 2017 at 3:25 PM, Marco Mistroni wrote: > Hi > not sure if this will help at all, and pls take it with a pinch of salt as > i

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread KhajaAsmath Mohammed
we are on 1.6.1 version of spark under CDH5.7.1 On Mon, Feb 6, 2017 at 2:53 PM, Xiao Li wrote: > Which Spark version are you using? > > 2017-02-06 12:25 GMT-05:00 vaquar khan : > >> Did you try MSCK REPAIR TABLE ? >> >> Regards, >> Vaquar Khan >>

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread Xiao Li
Which Spark version are you using? 2017-02-06 12:25 GMT-05:00 vaquar khan : > Did you try MSCK REPAIR TABLE ? > > Regards, > Vaquar Khan > > On Feb 6, 2017 11:21 AM, "KhajaAsmath Mohammed" > wrote: > >> I dont think so, i was able to insert

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread KhajaAsmath Mohammed
Tried below in spark shell and in dataframes. None of them worked. Can access same view in HUE. scala> hiveObj.refreshTable("dtmlab.vehscan_jackwagon_xml_mart_view") scala> val sample = sqlContext.sql("select * from dtmlab.vehscan_jackwagon_xml_mart_view").collect()

Spark mapPartition output object size coming larger than expected

2017-02-06 Thread nitinkak001
I am storing the output of mapPartitions in a ListBuffer and exposing its iterator as the output. The output is a list of Long tuples(Tuple2). When I check the size of the object using Spark's SizeEstimator.estimate method it comes out to 80 bytes per record/tuple object(calculating this by "size

[Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Egor Pahomov
Hi, I'm thinking of using Structured Streaming instead of old streaming, but I need to be able to save results to Hive table. Documentation for file sink says( http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks): "Supports writes to partitioned tables. ".

Re: Spark 2 - Creating datasets from dataframes with extra columns

2017-02-06 Thread Don Drake
This seems like a bug to me, the schemas should match. scala> import org.apache.spark.sql.Encoders import org.apache.spark.sql.Encoders scala> val fEncoder = Encoders.product[F] fEncoder: org.apache.spark.sql.Encoder[F] = class[f1[0]: string, f2[0]: string, f3[0]: string] scala> fEncoder.schema

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread vaquar khan
Did you try MSCK REPAIR TABLE ? Regards, Vaquar Khan On Feb 6, 2017 11:21 AM, "KhajaAsmath Mohammed" wrote: > I dont think so, i was able to insert overwrite other created tables in > hive using spark sql. The only problem I am facing is, spark is not able > to

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread KhajaAsmath Mohammed
I dont think so, i was able to insert overwrite other created tables in hive using spark sql. The only problem I am facing is, spark is not able to recognize hive view name. Very strange but not sure where I am doing wrong in this. On Mon, Feb 6, 2017 at 11:03 AM, Jon Gregg

Re: Cannot read Hive Views in Spark SQL

2017-02-06 Thread Jon Gregg
Confirming that Spark can read newly created views - I just created a test view in HDFS and I was able to query it in Spark 1.5 immediately after without a refresh. Possibly an issue with your Spark-Hive connection? Jon On Sun, Feb 5, 2017 at 9:31 PM, KhajaAsmath Mohammed <

Re: Spark: Scala Shell Very Slow (Unresponsive)

2017-02-06 Thread Irving Duran
I only experience this on the first time that I install a new spark version. Then after that, it flows smoothly. My question is (since you say your server), I assume that you are connecting remotely, so do you experience the same latency when invoking remote commands? If so, then it might be

PCA slow in comparison with single-threaded R version

2017-02-06 Thread Marek Wiewiorka
Hi All, I hit performance issues with running PCA for matrix with greater number of features (2.5k x 15k): import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.linalg.DenseVector import

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-06 Thread Hollin Wilkins
Hi All - We got a number of great questions and ended up adding responses to them on the MLeap Documentation page, in the FAQ section . We're also including a "condensed" version at the bottom of this email. We appreciate the interest and the discussion

Re: How to specify "verbose GC" in Spark submit?

2017-02-06 Thread Md. Rezaul Karim
Thanks, Bryan. Got your point. Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA Business Park, Dangan, Galway, Ireland Web: http://www.reza-analytics.eu/index.html

Re: How to specify "verbose GC" in Spark submit?

2017-02-06 Thread Bryan Jeffrey
Hello. When specifying GC options for Spark you must determine where you want the GC options specified - on the executors or on the driver. When you submit your job, for the driver, specify '--driver-java-options "-XX:+PrintFlagsFinal -verbose:gc", etc. For the executor specify --conf

How to specify "verbose GC" in Spark submit?

2017-02-06 Thread Md. Rezaul Karim
Dear All, Is there any way to specify verbose GC -i.e. “-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps” in Spark submit? Regards, _ *Md. Rezaul Karim*, BSc, MSc PhD Researcher, INSIGHT Centre for Data Analytics National University of Ireland, Galway IDA

Re: using an alternative slf4j implementation

2017-02-06 Thread Steve Loughran
> On 6 Feb 2017, at 11:06, Mendelson, Assaf wrote: > > Found some questions (without answers) and I found some jira > (https://issues.apache.org/jira/browse/SPARK-4147 and > https://issues.apache.org/jira/browse/SPARK-14703), however they do not solve > the issue. >

RE: using an alternative slf4j implementation

2017-02-06 Thread Mendelson, Assaf
Found some questions (without answers) and I found some jira (https://issues.apache.org/jira/browse/SPARK-4147 and https://issues.apache.org/jira/browse/SPARK-14703), however they do not solve the issue. Nominally, a library should not explicitly set a binding, however spark, does so (I

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
Ah ok Thanks for clearing it up Ayan! i will give that a go Thank you all for your help, this mailing list is awesome! On Mon, Feb 6, 2017 at 9:07 AM, ayan guha wrote: > If I am not missing anything here, "So I know which columns are numeric > and which arent because I

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
If I am not missing anything here, "So I know which columns are numeric and which arent because I have a StructType and all the internal StructFields will tell me which ones have a DataType which is numeric and which arent" will lead to getting to a list of fields which should be numeric.

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
Yup sorry I should have explained myself better So I know which columns are numeric and which arent because I have a StructType and all the internal StructFields will tell me which ones have a DataType which is numeric and which arent So assuming I have a json string which has double quotes on

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
UmmI think the premise is you need to "know" beforehand which columns are numeric.Unless you know it, how would you apply the schema? On Mon, Feb 6, 2017 at 7:54 PM, Sam Elamin wrote: > Thanks ayan but I meant how to derive the list automatically > > In your

Re: specifing schema on dataframe

2017-02-06 Thread Sam Elamin
Thanks ayan but I meant how to derive the list automatically In your example you are specifying the numeric columns and I would like it to be applied to any schema if that makes sense On Mon, 6 Feb 2017 at 08:49, ayan guha wrote: > SImple (pyspark) example: > > >>> df =

Re: specifing schema on dataframe

2017-02-06 Thread ayan guha
SImple (pyspark) example: >>> df = sqlContext.read.json("/user/l_aguha/spark_qs.json") >>> df.printSchema() root |-- customerid: string (nullable = true) |-- foo: string (nullable = true) >>> numeric_field_list = ['customerid'] >>> for k in numeric_field_list: ... df =

Re: using an alternative slf4j implementation

2017-02-06 Thread Jacek Laskowski
Hi, Sounds like a quite involved development for me. I can't help here. I'd suggest going through the dev and user mailing lists for the past year and JIRA issues regarding the issue as I vaguely remember some discussions about logging in Spark (that would merit to do the migration to logback

RE: using an alternative slf4j implementation

2017-02-06 Thread Mendelson, Assaf
Shading doesn’t help (we already shaded everything). According to https://www.slf4j.org/codes.html#multiple_bindings only one binding can be used. The problem is that once we link to spark jars then we automatically inherit spark’s binding (for log4j). I would like to find a way to either send

Re: [ML] MLeap: Deploy Spark ML Pipelines w/o SparkContext

2017-02-06 Thread Aseem Bansal
I agree with you that this is needed. There is a JIRA https://issues.apache.org/jira/browse/SPARK-10413 On Sun, Feb 5, 2017 at 11:21 PM, Debasish Das wrote: > Hi Aseem, > > Due to production deploy, we did not upgrade to 2.0 but that's critical > item on our list. > >