Re: [StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-11 Thread Rick Moritz
Hi Gerard, hi List, I think what this would entail is for Source.commit to change its funcationality. You would need to track all streams' offsets there. Especially in the socket source, you already have a cache (haven't looked at Kafka's implementation to closely yet), so that shouldn't be the

Re: Does Spark SQL uses Calcite?

2017-08-11 Thread kant kodali
@Ryan it looks like if I enable thrift server I need to go through hive. I was talking more about having JDBC connector for Spark SQL itself other words not going through hive. On Fri, Aug 11, 2017 at 6:50 PM, kant kodali wrote: > @Ryan Does it work with Spark SQL 2.1.1? > >

Re: Does Spark SQL uses Calcite?

2017-08-11 Thread kant kodali
@Ryan Does it work with Spark SQL 2.1.1? On Fri, Aug 11, 2017 at 12:53 AM, Ryan wrote: > the thrift server is a jdbc server, Kanth > > On Fri, Aug 11, 2017 at 2:51 PM, wrote: > >> I also wonder why there isn't a jdbc connector for spark sql? >> >>

[StructuredStreaming] multiple queries of the socket source: only one query works.

2017-08-11 Thread Gerard Maas
Hi, I've been investigating this SO question: https://stackoverflow.com/questions/45618489/executing-separate-streaming-queries-in-spark-structured-streaming TL;DR: when using the Socket source, trying to create multiple queries does not work properly, only one the first query in the start order

Structured Streaming: multiple sinks

2017-08-11 Thread aravias
1) We are consuming from kafka using structured streaming and writing the processed data set to s3. We also want to write the processed data to kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1) 2) In the logs, I see the streaming query

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-11 Thread Sathish Kumaran Vairavelu
I think you can collect the results in driver through toLocalIterator method of RDD and save the result to the driver program; rather than writing it to the file on the local disk and collecting it separately. If your data is small enough and if you have enough cores/memory try processing

Re: SQL specific documentation for recent Spark releases

2017-08-11 Thread Reynold Xin
This PR should help you in the next release: https://github.com/apache/spark/pull/18702 On Thu, Aug 10, 2017 at 7:46 PM, Stephen Boesch wrote: > > The correct link is https://docs.databricks.com/ > spark/latest/spark-sql/index.html . > > This link does have the core syntax

Re: spark.write.csv is not able write files to specified path, but is writing to unintended subfolder _temporary/0/task_xxx folder on worker nodes

2017-08-11 Thread Steve Loughran
On 10 Aug 2017, at 09:51, Hemanth Gudela > wrote: Yeah, installing HDFS in our environment is unfornutately going to take lot of time (approvals/planning etc). I will have to live with local FS for now. The other option I had

Re: Write only one output file in Spark SQL

2017-08-11 Thread Chetan Khatri
What you can do is at hive creates partitioned column for example date and use Val finalDf = repartition(data frame.col("date-column")) and later say insert overwrite tablename partition(date-column) select * from tempview Would work as expected On 11-Aug-2017 11:03 PM, "KhajaAsmath Mohammed"

Re: Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
we had spark.sql.partitions as 4 but in hdfs it is ending up with 200 files and 4 files are actually having data and rest of them are having zero bytes. My only requirement is to run fast for hive insert overwrite query from spark temporary table and end up having less files instead of more files

Re: Write only one output file in Spark SQL

2017-08-11 Thread Lukas Bradley
Please show the write() call, and the results in HDFS. What are all the files you see? On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > tempTable = union_df.registerTempTable("tempRaw") > > create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq

Re: Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
tempTable = union_df.registerTempTable("tempRaw") create = hc.sql('CREATE TABLE IF NOT EXISTS blab.pyspark_dpprq (vin string, utctime timestamp, description string, descriptionuom string, providerdesc string, dt_map string, islocation string, latitude double, longitude double, speed double, value

Re: Structured Streaming + Kafka Integration unable to read new messages after sometimes

2017-08-11 Thread Jacek Laskowski
Hi, Any logs you could share? Anything about the query itself? Watermarked? Aggregation? How long does it work fine? Is this somehow stable in its instability? What version of Spark and Kafka? Pozdrawiam, Jacek Laskowski http://blog.japila.pl On 11 Aug 2017 11:29, "NikhilP"

Re: Write only one output file in Spark SQL

2017-08-11 Thread Daniel van der Ende
Hi Asmath, Could you share the code you're running? Daniel On Fri, 11 Aug 2017, 17:53 KhajaAsmath Mohammed, wrote: > Hi, > > > > I am using spark sql to write data back to hdfs and it is resulting in > multiple output files. > > > > I tried changing number

Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
Hi, I am using spark sql to write data back to hdfs and it is resulting in multiple output files. I tried changing number spark.sql.shuffle.partitions=1 but it resulted in very slow performance. Also tried coalesce and repartition still the same issue. any suggestions? Thanks, Asmath

[Spark Core] Is it possible to insert a function directly into the Logical Plan?

2017-08-11 Thread Lukas Bradley
We have had issues with gathering status on long running jobs. We have attempted to draw parallels between the Spark UI/Monitoring API and our code base. Due to the separation between code and the execution plan, even having a guess as to where we are in the process is difficult. The

Fwd: Issues when trying to recover a textFileStream from checkpoint in Spark streaming

2017-08-11 Thread swetha kasireddy
Hi, I am facing issues while trying to recover a textFileStream from checkpoint. Basically it is trying to load the files from the begining of the job start whereas I am deleting the files after processing them. I have the following configs set so was thinking that it should not look for files

Structured Streaming + Kafka Integration unable to read new messages after sometimes

2017-08-11 Thread NikhilP
I working on structured streaming reading the data from kafka and writng it back to kafka.I am facing a weird issue,Sometime the jobs stops reading the data from kafka topics and after deleting the checkpoint directory it starts again reading the kafka topics. can anybody help me in resolving the

Re: XML Parsing with Spark and SCala

2017-08-11 Thread Jörn Franke
Can you specify what "is not able to load" means and what are the expected results? > On 11. Aug 2017, at 09:30, Etisha Jain wrote: > > Hi > > I want to do xml parsing with spark, but the data from the file is not able > to load and the desired output is also not

ThriftServer Start Error

2017-08-11 Thread Ascot Moss
Hi I tried to start spark-thrift server but get following error: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] java.io.IOException: javax.security.sasl.SaslException: GSS initiate

Spark ThriftServer Error

2017-08-11 Thread Ascot Moss
Hi, When started thriftSever, got the following issue: 17/08/11 16:06:56 ERROR util.Utils: Uncaught exception in thread Thread-3 java.lang.NullPointerException at org.apache.spark.sql.hive.thriftserver.HiveThriftServer2$$anonfun$main$1.apply$mcV$sp(HiveThriftServer2.scala:85) at

Re: Does Spark SQL uses Calcite?

2017-08-11 Thread Ryan
the thrift server is a jdbc server, Kanth On Fri, Aug 11, 2017 at 2:51 PM, wrote: > I also wonder why there isn't a jdbc connector for spark sql? > > Sent from my iPhone > > On Aug 10, 2017, at 2:45 PM, Jules Damji wrote: > > Yes, it's more used in Hive

XML Parsing with Spark and SCala

2017-08-11 Thread Etisha Jain
Hi I want to do xml parsing with spark, but the data from the file is not able to load and the desired output is also not coming. I am attaching a file. Can anyone help me to do this solvePuzzle1.scala --

Re: Does Spark SQL uses Calcite?

2017-08-11 Thread kanth909
I also wonder why there isn't a jdbc connector for spark sql? Sent from my iPhone > On Aug 10, 2017, at 2:45 PM, Jules Damji wrote: > > Yes, it's more used in Hive than Spark > > Sent from my iPhone > Pardon the dumb thumb typos :) > >> On Aug 10, 2017, at 2:24 PM,

Spark Thriftserver ERROR

2017-08-11 Thread Ascot Moss
Hi I tried to start spark-thrift server but get following error: javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)] java.io.IOException: javax.security.sasl.SaslException: GSS initiate