Re: Is RDD.persist honoured if multiple actions are executed in parallel
If you want to ensure the persisted RDD has been calculated first, just run foreach with a dummy function first to force evaluation. -- Michael Mior michael.m...@gmail.com Le jeu. 24 sept. 2020 à 00:38, Arya Ketan a écrit : > > Thanks, we were able to validate the same behaviour. > > On Wed, 23 Sep 2020 at 18:05, Sean Owen wrote: >> >> It is but it happens asynchronously. If you access the same block twice >> quickly, the cached block may not yet be available the second time yet. >> >> On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wrote: >>> >>> Hi, >>> I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I >>> have multiple actions. I am running them in parallel by executing the >>> actions in separate threads. I have a rdd.persist after which the DAG >>> forks into multiple actions. >>> but I see that rdd caching is not happening and the entire DAG is executed >>> twice ( once in each action) . >>> >>> What am I missing? >>> Arya >>> >>> >> >> > -- > Arya - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Why Apache Spark doesn't use Calcite?
It's fairly common for adapters (Calcite's abstraction of a data source) to push down predicates. However, the API certainly looks a lot different than Catalyst's. -- Michael Mior mm...@apache.org Le lun. 13 janv. 2020 à 09:45, Jason Nerothin a écrit : > > The implementation they chose supports push down predicates, Datasets and > other features that are not available in Calcite: > > https://databricks.com/glossary/catalyst-optimizer > > On Mon, Jan 13, 2020 at 8:24 AM newroyker wrote: >> >> Was there a qualitative or quantitative benchmark done before a design >> decision was made not to use Calcite? >> >> Are there limitations (for heuristic based, cost based, * aware optimizer) >> in Calcite, and frameworks built on top of Calcite? In the context of big >> data / TCPH benchmarks. >> >> I was unable to dig up anything concrete from user group / Jira. Appreciate >> if any Catalyst veteran here can give me pointers. Trying to defend >> Spark/Catalyst. >> >> >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > > > -- > Thanks, > Jason - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Using Spark with Local File System/NFS
If you put a * in the path, Spark will look for a file or directory named *. To read all the files in a directory, just remove the star. -- Michael Mior michael.m...@gmail.com On Jun 22, 2017 17:21, "saatvikshah1994" wrote: > Hi, > > I've downloaded and kept the same set of data files on all my cluster > nodes, > in the same absolute path - say /home/xyzuser/data/*. I am now trying to > perform an operation(say open(filename).read()) on all these files in > spark, > but by passing local file paths. I was under the assumption that as long as > the worker can find the file path it will be able to execute it. However, > my > Spark tasks fail with the error(/home/xyzuser/data/* is not present) - and > Im sure its present on all my worker nodes. > > If this experiment was successful I was planning to setup a NFS (actually > more like a read-only cloud persistent disk connected to my cluster nodes > in > dataproc) and use that instead. > > What exactly is going wrong here? > > Thanks > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Using-Spark-with-Local-File-System-NFS-tp28781.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: "Sharing" dataframes...
This is a puzzling suggestion to me. It's unclear what features the OP needs, so it's really hard to say whether Livy or job-server aren't sufficient. It's true that neither are particularly mature, but they're much more mature than a homemade project which hasn't started yet. That said, I'm not very familiar with either project, so perhaps there are some big concerns I'm not aware of. -- Michael Mior mm...@apache.org 2017-06-21 3:19 GMT-04:00 Rick Moritz : > Keeping it inside the same program/SparkContext is the most performant > solution, since you can avoid serialization and deserialization. > In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM > and invokes serialization and deserialization. Technologies that can help > you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra > with in-memory tables and a memory-backed HDFS-directory (see tiered > storage). > Although livy and job-server provide the functionality of providing a > single SparkContext to mutliple programs, I would recommend you build your > own framework for integrating different jobs, since many features you may > need aren't present yet, while others may cause issues due to lack of > maturity. Artificially splitting jobs is in general a bad idea, since it > breaks the DAG and thus prevents some potential push-down optimizations. > > On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin wrote: > >> Thanks Vadim & Jörn... I will look into those. >> >> jg >> >> On Jun 20, 2017, at 2:12 PM, Vadim Semenov >> wrote: >> >> You can launch one permanent spark context and then execute your jobs >> within the context. And since they'll be running in the same context, they >> can share data easily. >> >> These two projects provide the functionality that you need: >> https://github.com/spark-jobserver/spark-jobserver#persisten >> t-context-mode---faster--required-for-related-jobs >> https://github.com/cloudera/livy#post-sessions >> >> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin wrote: >> >>> Hey, >>> >>> Here is my need: program A does something on a set of data and produces >>> results, program B does that on another set, and finally, program C >>> combines the data of A and B. Of course, the easy way is to dump all on >>> disk after A and B are done, but I wanted to avoid this. >>> >>> I was thinking of creating a temp view, but I do not really like the >>> temp aspect of it ;). Any idea (they are all worth sharing) >>> >>> jg >>> >>> >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >> >
Re: Do we anything for Deep Learning in Spark?
It's still in the early stages, but check out Deep Learning Pipelines from Databricks https://github.com/databricks/spark-deep-learning -- Michael Mior mm...@apache.org 2017-06-20 0:36 GMT-04:00 Gaurav1809 : > Hi All, > > Similar to how we have machine learning library called ML, do we have > anything for deep learning? > If yes, please share the details. If not then what should be the approach? > > Thanks and regards, > Gaurav Pandya > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Do-we-anything-for-Deep-Learning-in- > Spark-tp28772.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: [SparkSQL] Escaping a query for a dataframe query
Assuming the parameter to your UDF should be start"end (with a quote in the middle) then you need to insert a backslash into the query (which must also be escaped in your code). So just add two extra backslashes before the quote inside the string. sqlContext.sql("SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND (myudfsearchfor(\"start\\\"end\"))" -- Michael Mior mm...@apache.org 2017-06-15 12:05 GMT-04:00 mark.jenki...@baesystems.com < mark.jenki...@baesystems.com>: > *Hi,* > > > > *I have a query **sqlContext.sql(“**SELECT * FROM mytable WHERE > (mycolumn BETWEEN 1 AND 2) AND (myudfsearchfor(\“start\"end\”))”* > > > > *How should I escape the double quote so that it successfully parses? * > > > > *I know I can use single quotes but I do not want to since I may need to > search for a single and double quote.* > > > > *The exception I get is* > > > > *[Thread-18] ERROR QueryService$ - Failed to complete query, will mark job > as failed java.lang.RuntimeException: [1.117] failure: ``)'' expected but > "end" found* > > > > *SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND > (myudfsearchfor(\“start\"end\”))* > > * > ^* > > * at scala.sys.package$.error(package.scala:27)* > > * at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)* > > * at > org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)* > > * at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)* > > * at > org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)* > > > > *Thankyou* > Please consider the environment before printing this email. This message > should be regarded as confidential. If you have received this email in > error please notify the sender and destroy it immediately. Statements of > intent shall only become binding when confirmed in hard copy by an > authorised signatory. The contents of this email may relate to dealings > with other companies under the control of BAE Systems Applied Intelligence > Limited, details of which can be found at http://www.baesystems.com/ > Businesses/index.htm. >
Re: Number Of Partitions in RDD
While I'm not sure why you're seeing an increase in partitions with such a small data file, it's worth noting that the second parameter to textFile is the *minimum* number of partitions so there's no guarantee you'll get exactly that number. -- Michael Mior mm...@apache.org 2017-06-01 6:28 GMT-04:00 Vikash Pareek : > Hi, > > I am creating a RDD from a text file by specifying number of partitions. > But > it gives me different number of partitions than the specified one. > > */scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at > textFile > at :27 > > scala> people.getNumPartitions > res47: Int = 1 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at > textFile > at :27 > > scala> people.getNumPartitions > res36: Int = 1 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at > textFile > at :27 > > scala> people.getNumPartitions > res37: Int = 2 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at > textFile > at :27 > > scala> people.getNumPartitions > res38: Int = 3 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at > textFile > at :27 > > scala> people.getNumPartitions > res39: Int = 4 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at > textFile > at :27 > > scala> people.getNumPartitions > res40: Int = 6 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at > textFile > at :27 > > scala> people.getNumPartitions > res41: Int = 7 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at > textFile > at :27 > > scala> people.getNumPartitions > res42: Int = 8 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at > textFile > at :27 > > scala> people.getNumPartitions > res43: Int = 9 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at > textFile > at :27 > > scala> people.getNumPartitions > res44: Int = 11 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at > textFile > at :27 > > scala> people.getNumPartitions > res45: Int = 11 > > scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11) > people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at > textFile > at :27 > > scala> people.getNumPartitions > res46: Int = 13/* > > Contents of the file /home/pvikash/data/test.txt is: > " > This is a test file. > Will be used for rdd partition > " > > I am trying to understand why number of partitions is changing here and in > case we have small data (which can fit into one partition) then why spark > creates empty partitions? > > Any explanation would be appreciated. > > --Vikash > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >