Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-24 Thread Michael Mior
If you want to ensure the persisted RDD has been calculated first,
just run foreach with a dummy function first to force evaluation.

--
Michael Mior
michael.m...@gmail.com

Le jeu. 24 sept. 2020 à 00:38, Arya Ketan  a écrit :
>
> Thanks, we were able to validate the same behaviour.
>
> On Wed, 23 Sep 2020 at 18:05, Sean Owen  wrote:
>>
>> It is but it happens asynchronously. If you access the same block twice 
>> quickly, the cached block may not yet be available the second time yet.
>>
>> On Wed, Sep 23, 2020, 7:17 AM Arya Ketan  wrote:
>>>
>>> Hi,
>>> I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I 
>>> have multiple actions. I am running them in parallel by executing the 
>>> actions in separate threads. I have  a rdd.persist after which the DAG 
>>> forks into multiple actions.
>>> but I see that rdd caching is not happening  and the entire DAG is executed 
>>> twice ( once in each action) .
>>>
>>> What am I missing?
>>> Arya
>>>
>>>
>>
>>
> --
> Arya

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Michael Mior
It's fairly common for adapters (Calcite's abstraction of a data
source) to push down predicates. However, the API certainly looks a
lot different than Catalyst's.
--
Michael Mior
mm...@apache.org

Le lun. 13 janv. 2020 à 09:45, Jason Nerothin
 a écrit :
>
> The implementation they chose supports push down predicates, Datasets and 
> other features that are not available in Calcite:
>
> https://databricks.com/glossary/catalyst-optimizer
>
> On Mon, Jan 13, 2020 at 8:24 AM newroyker  wrote:
>>
>> Was there a qualitative or quantitative benchmark done before a design
>> decision was made not to use Calcite?
>>
>> Are there limitations (for heuristic based, cost based, * aware optimizer)
>> in Calcite, and frameworks built on top of Calcite? In the context of big
>> data / TCPH benchmarks.
>>
>> I was unable to dig up anything concrete from user group / Jira. Appreciate
>> if any Catalyst veteran here can give me pointers. Trying to defend
>> Spark/Catalyst.
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>
>
> --
> Thanks,
> Jason

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Using Spark with Local File System/NFS

2017-06-22 Thread Michael Mior
If you put a * in the path, Spark will look for a file or directory named
*. To read all the files in a directory, just remove the star.

--
Michael Mior
michael.m...@gmail.com

On Jun 22, 2017 17:21, "saatvikshah1994"  wrote:

> Hi,
>
> I've downloaded and kept the same set of data files on all my cluster
> nodes,
> in the same absolute path - say /home/xyzuser/data/*. I am now trying to
> perform an operation(say open(filename).read()) on all these files in
> spark,
> but by passing local file paths. I was under the assumption that as long as
> the worker can find the file path it will be able to execute it. However,
> my
> Spark tasks fail with the error(/home/xyzuser/data/* is not present) - and
> Im sure its present on all my worker nodes.
>
> If this experiment was successful I was planning to setup a NFS (actually
> more like a read-only cloud persistent disk connected to my cluster nodes
> in
> dataproc) and use that instead.
>
> What exactly is going wrong here?
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Using-Spark-with-Local-File-System-NFS-tp28781.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: "Sharing" dataframes...

2017-06-21 Thread Michael Mior
This is a puzzling suggestion to me. It's unclear what features the OP
needs, so it's really hard to say whether Livy or job-server aren't
sufficient. It's true that neither are particularly mature, but they're
much more mature than a homemade project which hasn't started yet.

That said, I'm not very familiar with either project, so perhaps there are
some big concerns I'm not aware of.

--
Michael Mior
mm...@apache.org

2017-06-21 3:19 GMT-04:00 Rick Moritz :

> Keeping it inside the same program/SparkContext is the most performant
> solution, since you can avoid serialization and deserialization.
> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
> and invokes serialization and deserialization. Technologies that can help
> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
> with in-memory tables and a memory-backed HDFS-directory (see tiered
> storage).
> Although livy and job-server provide the functionality of providing a
> single SparkContext to mutliple programs, I would recommend you build your
> own framework for integrating different jobs, since many features you may
> need aren't present yet, while others may cause issues due to lack of
> maturity. Artificially splitting jobs is in general a bad idea, since it
> breaks the DAG and thus prevents some potential push-down optimizations.
>
> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin  wrote:
>
>> Thanks Vadim & Jörn... I will look into those.
>>
>> jg
>>
>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov 
>> wrote:
>>
>> You can launch one permanent spark context and then execute your jobs
>> within the context. And since they'll be running in the same context, they
>> can share data easily.
>>
>> These two projects provide the functionality that you need:
>> https://github.com/spark-jobserver/spark-jobserver#persisten
>> t-context-mode---faster--required-for-related-jobs
>> https://github.com/cloudera/livy#post-sessions
>>
>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin  wrote:
>>
>>> Hey,
>>>
>>> Here is my need: program A does something on a set of data and produces
>>> results, program B does that on another set, and finally, program C
>>> combines the data of A and B. Of course, the easy way is to dump all on
>>> disk after A and B are done, but I wanted to avoid this.
>>>
>>> I was thinking of creating a temp view, but I do not really like the
>>> temp aspect of it ;). Any idea (they are all worth sharing)
>>>
>>> jg
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>>
>


Re: Do we anything for Deep Learning in Spark?

2017-06-20 Thread Michael Mior
It's still in the early stages, but check out Deep Learning Pipelines from
Databricks

https://github.com/databricks/spark-deep-learning

--
Michael Mior
mm...@apache.org

2017-06-20 0:36 GMT-04:00 Gaurav1809 :

> Hi All,
>
> Similar to how we have machine learning library called ML, do we have
> anything for deep learning?
> If yes, please share the details. If not then what should be the approach?
>
> Thanks and regards,
> Gaurav Pandya
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Do-we-anything-for-Deep-Learning-in-
> Spark-tp28772.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: [SparkSQL] Escaping a query for a dataframe query

2017-06-15 Thread Michael Mior
Assuming the parameter to your UDF should be start"end (with a quote in the
middle) then you need to insert a backslash into the query (which must also
be escaped in your code). So just add two extra backslashes before the
quote inside the string.

sqlContext.sql("SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND
(myudfsearchfor(\"start\\\"end\"))"

--
Michael Mior
mm...@apache.org

2017-06-15 12:05 GMT-04:00 mark.jenki...@baesystems.com <
mark.jenki...@baesystems.com>:

> *Hi,*
>
>
>
> *I have a query  **sqlContext.sql(“**SELECT * FROM mytable WHERE
> (mycolumn BETWEEN 1 AND 2) AND (myudfsearchfor(\“start\"end\”))”*
>
>
>
> *How should I escape the double quote so that it successfully parses? *
>
>
>
> *I know I can use single quotes but I do not want to since I may need to 
> search for a single and double quote.*
>
>
>
> *The exception I get is*
>
>
>
> *[Thread-18] ERROR QueryService$ - Failed to complete query, will mark job
> as failed java.lang.RuntimeException: [1.117] failure: ``)'' expected but
> "end" found*
>
>
>
> *SELECT * FROM mytable WHERE (mycolumn BETWEEN 1 AND 2) AND
> (myudfsearchfor(\“start\"end\”))*
>
> *
> ^*
>
> *  at scala.sys.package$.error(package.scala:27)*
>
> *  at
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)*
>
> *  at
> org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)*
>
> *  at
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)*
>
> *  at
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)*
>
>
>
> *Thankyou*
> Please consider the environment before printing this email. This message
> should be regarded as confidential. If you have received this email in
> error please notify the sender and destroy it immediately. Statements of
> intent shall only become binding when confirmed in hard copy by an
> authorised signatory. The contents of this email may relate to dealings
> with other companies under the control of BAE Systems Applied Intelligence
> Limited, details of which can be found at http://www.baesystems.com/
> Businesses/index.htm.
>


Re: Number Of Partitions in RDD

2017-06-01 Thread Michael Mior
While I'm not sure why you're seeing an increase in partitions with such a
small data file, it's worth noting that the second parameter to textFile is
the *minimum* number of partitions so there's no guarantee you'll get
exactly that number.

--
Michael Mior
mm...@apache.org

2017-06-01 6:28 GMT-04:00 Vikash Pareek :

> Hi,
>
> I am creating a RDD from a text file by specifying number of partitions.
> But
> it gives me different number of partitions than the specified one.
>
> */scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 0)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[72] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res47: Int = 1
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 1)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[50] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res36: Int = 1
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 2)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[52] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res37: Int = 2
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 3)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[54] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res38: Int = 3
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 4)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[56] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res39: Int = 4
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 5)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[58] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res40: Int = 6
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 6)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[60] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res41: Int = 7
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 7)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[62] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res42: Int = 8
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 8)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[64] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res43: Int = 9
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 9)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[66] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res44: Int = 11
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 10)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[68] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res45: Int = 11
>
> scala> val people = sc.textFile("file:///home/pvikash/data/test.txt", 11)
> people: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[70] at
> textFile
> at :27
>
> scala> people.getNumPartitions
> res46: Int = 13/*
>
> Contents of the file /home/pvikash/data/test.txt is:
> "
> This is a test file.
> Will be used for rdd partition
> "
>
> I am trying to understand why number of partitions is changing here and in
> case we have small data (which can fit into one partition) then why spark
> creates empty partitions?
>
> Any explanation would be appreciated.
>
> --Vikash
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Number-Of-Partitions-in-RDD-tp28730.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>