Re: Mulitple joins with same Dataframe throws AnalysisException: resolved attribute(s)
Spark 2.3.0 has this problem upgrade it to 2.3.1 Sent from my iPhone > On Jul 19, 2018, at 2:13 PM, Nirav Patel wrote: > > corrected subject line. It's missing attribute error not ambiguous reference > error. > >> On Thu, Jul 19, 2018 at 2:11 PM, Nirav Patel wrote: >> I am getting attribute missing error after joining dataframe 'df2' twice . >> >> Exception in thread "main" org.apache.spark.sql.AnalysisException: resolved >> attribute(s) fid#49 missing from >> value#14,value#126,mgrId#15,name#16,d31#109,df2Id#125,df2Id#47,d4#130,d3#129,df1Id#13,name#128,fId#127 >> in operator !Join LeftOuter, (mgrId#15 = fid#49);; >> !Join LeftOuter, (mgrId#15 = fid#49) >> :- Project [df1Id#13, value#14, mgrId#15, name#16, df2Id#47, d3#51 AS >> d31#109] >> : +- Join Inner, (df1Id#13 = fid#49) >> : :- Project [_1#6 AS df1Id#13, _2#7 AS value#14, _3#8 AS mgrId#15, _4#9 >> AS name#16, _5#10 AS d1#17, _6#11 AS d2#18] >> : : +- LocalRelation [_1#6, _2#7, _3#8, _4#9, _5#10, _6#11] >> : +- Project [_1#40 AS df2Id#47, _2#41 AS value#48, _3#42 AS fId#49, >> _4#43 AS name#50, _5#44 AS d3#51, _6#45 AS d4#52] >> :+- LocalRelation [_1#40, _2#41, _3#42, _4#43, _5#44, _6#45] >> +- Project [_1#40 AS df2Id#125, _2#41 AS value#126, _3#42 AS fId#127, _4#43 >> AS name#128, _5#44 AS d3#129, _6#45 AS d4#130] >>+- LocalRelation [_1#40, _2#41, _3#42, _4#43, _5#44, _6#45] >> >> at >> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) >> at >> >> >> As you can see "fid" is present but spark is looking for fid#49 while there >> is another one fid#127. >> >> Physical Plan of original df2 is >> == Physical Plan == >> LocalTableScan [df2Id#47, value#48, fId#49, name#50, d3#51, d4#52] >> >> But by looking at physical plan looks like there are multiple versions of >> 'fid' gets generated (fid#49, fid#127). >> >> Here's the full code. >> >> >> Code: >> >> val seq1 = Seq( >> (1,"a",1,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00"), >> (2,"a",0,"bla", "2014-01-01 00:00:00", "2014-09-12 18:55:43"), >> (3,"a",2,"bla", "2000-12-01 00:00:00", "2000-01-01 00:00:00"), >> (4,"bb",1,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00"), >> (5,"bb",2,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00"), >> (6,"bb",0,"bla", "2014-01-01 00:00:00", "2014-01-01 00:00:00")) >> //val rdd1 = spark.sparkContext.parallelize(seq1) >> val df1= seq1.toDF("id","value","mgrId", "name", "d1", "d2") >> df1.show() >> >> val seq2 = Seq( >> (1,"a1",1,"duh", "2014-01-01 00:00:00", "2014-01-01 00:00:00"), >> (2,"a2",1,"duh", "2014-01-01 00:00:00", "2014-09-12 18:55:43"), >> (3,"a3",2,"jah", "2000-12-01 00:00:00", "2000-01-01 00:00:00"), >> (4,"a4",3,"duh", "2014-01-01 00:00:00", "2014-01-01 00:00:00"), >> (5,"a5",4,"jah", "2014-01-01 00:00:00", "2014-01-01 00:00:00"), >> (6,"a6",5,"jah", "2014-01-01 00:00:00", "2014-01-01 00:00:00")) >> >> >> val df2 = seq2.toDF("id","value","fId", "name", "d1", "d2") >> df2.explain() >> df2.show() >> >> val join1 = df1 >> .join(df2, >> df1("id") === df2("fid")) >> .select(df1("id"), df1("value"), df1("mgrId"), df1("name"), >> df2("id").as("df2id"), df2("fid"), df2("value")) >> join1.printSchema() >> join1.show() >> >> val join2 = join1 >> .join(df2, >> join1("mgrId") === df2("fid"), >> "left") >>.select(join1("id"), join1("value"), join1("mgrId"), join1("name"), >> join1("df2id"), >>join1("fid"), df2("fid").as("df2fid")) >> join2.printSchema() >> join2.show() >> >> >> >> >> >> > > > > > > >
How does databricks ui work ?
Hi All, I had seen some talks where a databricks host writes a sql query in the front end and clicks run then the results are pushed to front end as a stream(the graphs are updated as they get new responses) this pushing data to the front end part I can assume is done via web sockets however I more interested to know how the queries are submitted to the backend in a Ad-hoc manner? Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [Spark] Can Apache Spark be used with time series processing?
I don't see why not Sent from my iPhone > On Aug 24, 2017, at 1:52 PM, Alexandr Porunov> wrote: > > Hello, > > I am new in Apache Spark. I need to process different time series data > (numeric values which depend on time) and react on next actions: > 1. Data is changing up or down too fast. > 2. Data is changing constantly up or down too long. > > For example, if the data have changed 30% up or down in the last five minutes > (or less), then I need to send a special event. > If the data have changed 50% up or down in two hours (or less), then I need > to send a special event. > > Frequency of data changing is about 1000-3000 per second. And I need to react > as soon as possible. > > Does Apache Spark fit well for this scenario or I need to search for another > solution? > Sorry for stupid question, but I am a total newbie. > > Regards - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Does Spark SQL uses Calcite?
I also wonder why there isn't a jdbc connector for spark sql? Sent from my iPhone > On Aug 10, 2017, at 2:45 PM, Jules Damjiwrote: > > Yes, it's more used in Hive than Spark > > Sent from my iPhone > Pardon the dumb thumb typos :) > >> On Aug 10, 2017, at 2:24 PM, Sathish Kumaran Vairavelu >> wrote: >> >> I think it is for hive dependency. >>> On Thu, Aug 10, 2017 at 4:14 PM kant kodali wrote: >>> Since I see a calcite dependency in Spark I wonder where Calcite is being >>> used? >>> On Thu, Aug 10, 2017 at 1:30 PM, Sathish Kumaran Vairavelu wrote: Spark SQL doesn't use Calcite > On Thu, Aug 10, 2017 at 3:14 PM kant kodali wrote: > Hi All, > > Does Spark SQL uses Calcite? If so, what for? I thought the Spark SQL has > catalyst which would generate its own logical plans, physical plans and > other optimizations. > > Thanks, > Kant >>>
Re: Is there a Kafka sink for Spark Structured Streaming
Hi! Is this possible possible in spark 2.1.1? Sent from my iPhone > On May 19, 2017, at 5:55 AM, Patrick McGloin> wrote: > > # Write key-value data from a DataFrame to a Kafka topic specified in an > option > query = df \ > .selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value") > \ > .writeStream \ > .format("kafka") \ > .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ > .option("topic", "topic1") \ > .option("checkpointLocation", "/path/to/HDFS/dir") \ > .start() > Described here: > https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html > > >> On 19 May 2017 at 10:45, wrote: >> Is there a Kafka sink for Spark Structured Streaming ? >> >> Sent from my iPhone >
Is there a Kafka sink for Spark Structured Streaming
Is there a Kafka sink for Spark Structured Streaming ? Sent from my iPhone
Re: Jupyter spark Scala notebooks
Which of these notebooks can help populate real time graphs through web socket or some sort of push mechanism? Sent from my iPhone > On May 17, 2017, at 8:50 PM, Stephen Boeschwrote: > > Jupyter with toree works well for my team. Jupyter is well more refined vs > zeppelin as far as notebook features and usability: shortcuts, editing,etc. > The caveat is it is better to run a separate server instanace for > python/pyspark vs scala/spark > > 2017-05-17 19:27 GMT-07:00 Richard Moorhead : >> Take a look at Apache Zeppelin; it has both python and scala interpreters. >> https://zeppelin.apache.org/ >> >> Apache Zeppelin >> zeppelin.apache.org >> Apache Zeppelin. A web-based notebook that enables interactive data >> analytics. You can make beautiful data-driven, interactive and collaborative >> documents with SQL ... >> >> >> >> >> . . . . . . . . . . . . . . . . . . . . . . . . . . . >> >> Richard Moorhead >> Software Engineer >> richard.moorh...@c2fo.com >> C2FO: The World's Market for Working CapitalĀ® >> >> The information contained in this message and any attachment may be >> privileged, confidential, and protected from disclosure. If you are not the >> intended recipient, or an employee, or agent responsible for delivering this >> message to the intended recipient, you are hereby notified that any >> dissemination, distribution, or copying of this communication is strictly >> prohibited. If you have received this communication in error, please notify >> us immediately by replying to the message and deleting from your computer. >> >> >> From: upendra 1991 >> Sent: Wednesday, May 17, 2017 9:22:14 PM >> To: user@spark.apache.org >> Subject: Jupyter spark Scala notebooks >> >> What's the best way to use jupyter with Scala spark. I tried Apache toree >> and created a kernel but did not get it working. I believe there is a better >> way. >> >> Please suggest any best practices. >> >> Sent from Yahoo Mail on Android >
Re: spark intermediate data fills up the disk
Hi! Yes these files are for shuffle blocks however they need to be cleaned as well right? I had been running a streaming application for 2 days. The third day my disk fills up with all .index and .data files and my assumption is that these files had been there since the start of my streaming application I should have checked the time stamp before doing rm -rf. Please let me know if I am wrong Sent from my iPhone > On Jan 26, 2017, at 4:24 PM, Takeshi Yamamurowrote: > > Yea, I think so and they are the intermediate files for shuffling. Probably, > kant checked the configuration here > (http://spark.apache.org/docs/latest/spark-standalone.html) though, this is > not related to the issue. > > // maropu > >> On Fri, Jan 27, 2017 at 7:46 AM, Jacek Laskowski wrote: >> Hi, >> >> The files are for shuffle blocks. Where did you find the docs about them? >> >> Jacek >> >> On 25 Jan 2017 8:41 p.m., "kant kodali" wrote: >> oh sorry its actually in the documentation. I should just set >> spark.worker.cleanup.enabled = true >> >> On Wed, Jan 25, 2017 at 11:30 AM, kant kodali wrote: >>> I have bunch of .index and .data files like that fills up my disk. I am not >>> sure what the fix is? I am running spark 2.0.2 in stand alone mode >>> >>> Thanks! >>> >>> >>> >>> >>> >> >> > > > > -- > --- > Takeshi Yamamuro