Re: Mulitple joins with same Dataframe throws AnalysisException: resolved attribute(s)

2018-07-19 Thread kanth909
Spark 2.3.0 has this problem upgrade it to 2.3.1 

Sent from my iPhone

> On Jul 19, 2018, at 2:13 PM, Nirav Patel  wrote:
> 
> corrected subject line. It's missing attribute error not ambiguous reference 
> error.
> 
>> On Thu, Jul 19, 2018 at 2:11 PM, Nirav Patel  wrote:
>> I am getting attribute missing error after joining dataframe 'df2' twice .
>> 
>> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved 
>> attribute(s) fid#49 missing from 
>> value#14,value#126,mgrId#15,name#16,d31#109,df2Id#125,df2Id#47,d4#130,d3#129,df1Id#13,name#128,fId#127
>>  in operator !Join LeftOuter, (mgrId#15 = fid#49);;
>> !Join LeftOuter, (mgrId#15 = fid#49)
>> :- Project [df1Id#13, value#14, mgrId#15, name#16, df2Id#47, d3#51 AS 
>> d31#109]
>> :  +- Join Inner, (df1Id#13 = fid#49)
>> : :- Project [_1#6 AS df1Id#13, _2#7 AS value#14, _3#8 AS mgrId#15, _4#9 
>> AS name#16, _5#10 AS d1#17, _6#11 AS d2#18]
>> : :  +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11]
>> : +- Project [_1#40 AS df2Id#47, _2#41 AS value#48, _3#42 AS fId#49, 
>> _4#43 AS name#50, _5#44 AS d3#51, _6#45 AS d4#52]
>> :+- LocalRelation [_1#40, _2#41, _3#42, _4#43, _5#44, _6#45]
>> +- Project [_1#40 AS df2Id#125, _2#41 AS value#126, _3#42 AS fId#127, _4#43 
>> AS name#128, _5#44 AS d3#129, _6#45 AS d4#130]
>>+- LocalRelation [_1#40, _2#41, _3#42, _4#43, _5#44, _6#45]
>> 
>>  at 
>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
>>  at 
>> 
>> 
>> As you can see "fid" is present but spark is looking for fid#49 while there 
>> is another one fid#127.
>> 
>> Physical Plan of original df2 is 
>> == Physical Plan ==
>> LocalTableScan [df2Id#47, value#48, fId#49, name#50, d3#51, d4#52]
>> 
>> But by looking at physical plan looks like there are multiple versions of 
>> 'fid' gets generated (fid#49, fid#127). 
>> 
>> Here's the full code.
>> 
>> 
>> Code:
>> 
>> val seq1 = Seq(
>> (1,"a",1,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00"),
>> (2,"a",0,"bla", "2014-01-01 00:00:00", "2014-09-12 18:55:43"),
>> (3,"a",2,"bla", "2000-12-01 00:00:00", "2000-01-01 00:00:00"),
>> (4,"bb",1,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00"),
>> (5,"bb",2,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00"),
>> (6,"bb",0,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00"))
>> //val rdd1 = spark.sparkContext.parallelize(seq1)
>> val df1= seq1.toDF("id","value","mgrId", "name", "d1", "d2")
>> df1.show()
>> 
>> val seq2 = Seq(
>> (1,"a1",1,"duh", "2014-01-01 00:00:00", "2014-01-01 00:00:00"),
>> (2,"a2",1,"duh", "2014-01-01 00:00:00", "2014-09-12 18:55:43"),
>> (3,"a3",2,"jah", "2000-12-01 00:00:00", "2000-01-01 00:00:00"),
>> (4,"a4",3,"duh", "2014-01-01 00:00:00", "2014-01-01 00:00:00"),
>> (5,"a5",4,"jah", "2014-01-01 00:00:00", "2014-01-01 00:00:00"),
>> (6,"a6",5,"jah", "2014-01-01 00:00:00", "2014-01-01 00:00:00"))
>> 
>> 
>> val df2 = seq2.toDF("id","value","fId", "name", "d1", "d2")
>> df2.explain()
>> df2.show()
>> 
>> val join1 = df1
>>   .join(df2,
>> df1("id") === df2("fid"))
>>   .select(df1("id"), df1("value"), df1("mgrId"), df1("name"), 
>> df2("id").as("df2id"), df2("fid"), df2("value"))
>> join1.printSchema()  
>> join1.show()
>> 
>> val join2 = join1
>>   .join(df2,
>>   join1("mgrId") === df2("fid"),
>>   "left")
>>.select(join1("id"), join1("value"), join1("mgrId"), join1("name"), 
>> join1("df2id"), 
>>join1("fid"), df2("fid").as("df2fid"))   
>> join2.printSchema()  
>> join2.show()  
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> 
> 
> 
> 
> 


How does databricks ui work ?

2017-09-02 Thread kanth909
Hi All, 

I had seen some talks where a databricks host writes a sql query in the front 
end and clicks run then the results are pushed to front end as a stream(the 
graphs are updated as they get new responses) this pushing data to the front 
end part I can assume is done via web sockets however I more interested to know 
how the queries are submitted to the backend in a Ad-hoc manner?

Sent from my iPhone
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Spark] Can Apache Spark be used with time series processing?

2017-08-30 Thread kanth909
I don't see why not

Sent from my iPhone

> On Aug 24, 2017, at 1:52 PM, Alexandr Porunov  
> wrote:
> 
> Hello,
> 
> I am new in Apache Spark. I need to process different time series data 
> (numeric values which depend on time) and react on next actions:
> 1. Data is changing up or down too fast.
> 2. Data is changing constantly up or down too long.
> 
> For example, if the data have changed 30% up or down in the last five minutes 
> (or less), then I need to send a special event.
> If the data have changed 50% up or down in two hours (or less), then I need 
> to send a special event.
> 
> Frequency of data changing is about 1000-3000 per second. And I need to react 
> as soon as possible. 
> 
> Does Apache Spark fit well for this scenario or I need to search for another 
> solution?
> Sorry for stupid question, but I am a total newbie.
> 
> Regards

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Does Spark SQL uses Calcite?

2017-08-11 Thread kanth909
I also wonder why there isn't a jdbc connector for spark sql?

Sent from my iPhone

> On Aug 10, 2017, at 2:45 PM, Jules Damji  wrote:
> 
> Yes, it's more used in Hive than Spark 
> 
> Sent from my iPhone
> Pardon the dumb thumb typos :)
> 
>> On Aug 10, 2017, at 2:24 PM, Sathish Kumaran Vairavelu 
>>  wrote:
>> 
>> I think it is for hive dependency.
>>> On Thu, Aug 10, 2017 at 4:14 PM kant kodali  wrote:
>>> Since I see a calcite dependency in Spark I wonder where Calcite is being 
>>> used?
>>> 
 On Thu, Aug 10, 2017 at 1:30 PM, Sathish Kumaran Vairavelu 
  wrote:
 Spark SQL doesn't use Calcite
 
> On Thu, Aug 10, 2017 at 3:14 PM kant kodali  wrote:
> Hi All, 
> 
> Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has 
> catalyst which would generate its own logical plans, physical plans and 
> other optimizations. 
> 
> Thanks,
> Kant
>>> 


Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kanth909
Hi!

Is this possible possible in spark 2.1.1?

Sent from my iPhone

> On May 19, 2017, at 5:55 AM, Patrick McGloin  
> wrote:
> 
> # Write key-value data from a DataFrame to a Kafka topic specified in an 
> option
> query = df \
>   .selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value") 
> \
>   .writeStream \
>   .format("kafka") \
>   .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
>   .option("topic", "topic1") \
>   .option("checkpointLocation", "/path/to/HDFS/dir") \
>   .start()
> Described here:
> https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html
> 
> 
>> On 19 May 2017 at 10:45,  wrote:
>> Is there a Kafka sink for Spark Structured Streaming ?
>> 
>> Sent from my iPhone
> 


Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kanth909
Is there a Kafka sink for Spark Structured Streaming ?

Sent from my iPhone

Re: Jupyter spark Scala notebooks

2017-05-17 Thread kanth909
Which of these notebooks can help populate real time graphs through web socket 
or some sort of push mechanism?

Sent from my iPhone

> On May 17, 2017, at 8:50 PM, Stephen Boesch  wrote:
> 
> Jupyter with toree works well for my team.  Jupyter is well more refined vs 
> zeppelin as far as notebook features and usability: shortcuts, editing,etc.   
> The caveat is it is better to run a separate server instanace for 
> python/pyspark vs scala/spark
> 
> 2017-05-17 19:27 GMT-07:00 Richard Moorhead :
>> Take a look at Apache Zeppelin; it has both python and scala interpreters.
>> https://zeppelin.apache.org/
>> 
>> Apache Zeppelin
>> zeppelin.apache.org
>> Apache Zeppelin. A web-based notebook that enables interactive data 
>> analytics. You can make beautiful data-driven, interactive and collaborative 
>> documents with SQL ...
>> 
>> 
>> 
>> 
>> . . . . . . . . . . . . . . . . . . . . . . . . . . .
>> 
>> Richard Moorhead
>> Software Engineer
>> richard.moorh...@c2fo.com
>> C2FO: The World's Market for Working CapitalĀ®
>>  
>> The information contained in this message and any attachment may be 
>> privileged, confidential, and protected from disclosure. If you are not the 
>> intended recipient, or an employee, or agent responsible for delivering this 
>> message to the intended recipient, you are hereby notified that any 
>> dissemination, distribution, or copying of this communication is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> us immediately by replying to the message and deleting from your computer.
>> 
>> 
>> From: upendra 1991 
>> Sent: Wednesday, May 17, 2017 9:22:14 PM
>> To: user@spark.apache.org
>> Subject: Jupyter spark Scala notebooks
>>  
>> What's the best way to use jupyter with Scala spark. I tried Apache toree 
>> and created a kernel but did not get it working. I believe there is a better 
>> way.
>> 
>> Please suggest any best practices.
>> 
>> Sent from Yahoo Mail on Android
> 


Re: spark intermediate data fills up the disk

2017-01-26 Thread kanth909
Hi!

Yes these files are for shuffle blocks however they need to be cleaned as well 
right? I had been running a streaming application for 2 days. The third day my 
disk fills up with all .index and .data files and my assumption is that these 
files had been there since the start of my streaming application I should have 
checked the time stamp before doing rm -rf. Please let me know if I am wrong 

Sent from my iPhone

> On Jan 26, 2017, at 4:24 PM, Takeshi Yamamuro  wrote:
> 
> Yea, I think so and they are the intermediate files for shuffling. Probably, 
> kant checked the configuration here 
> (http://spark.apache.org/docs/latest/spark-standalone.html) though, this is 
> not related to the issue.
> 
> // maropu
> 
>> On Fri, Jan 27, 2017 at 7:46 AM, Jacek Laskowski  wrote:
>> Hi, 
>> 
>> The files are for shuffle blocks. Where did you find the docs about them? 
>> 
>> Jacek 
>> 
>> On 25 Jan 2017 8:41 p.m., "kant kodali"  wrote:
>> oh sorry its actually in the documentation. I should just set 
>> spark.worker.cleanup.enabled = true
>> 
>> On Wed, Jan 25, 2017 at 11:30 AM, kant kodali  wrote:
>>> I have bunch of .index and .data files like that fills up my disk. I am not 
>>> sure what the fix is? I am running spark 2.0.2 in stand alone mode
>>> 
>>> Thanks!
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> -- 
> ---
> Takeshi Yamamuro