Re: Table created with saveAsTable behaves differently than a table created with spark.sql("CREATE TABLE....)

2023-01-21 Thread Peyman Mohajerian
In the case of saveAsTable("tablename") you specified the partition: ' partitionBy("partitionCol")' On Sat, Jan 21, 2023 at 4:03 AM krexos wrote: > My periodically running process writes data to a table over parquet files > with the configuration "spark.sql.sources.partitionOverwriteMode" = >

Re: How the data is distributed

2022-06-06 Thread Peyman Mohajerian
Later. On Mon, Jun 6, 2022 at 2:07 PM Sid wrote: > Hi experts, > > > When we load any file, I know that based on the information in the spark > session about the executors location, status and etc , the data is > distributed among the worker nodes and executors. > > But I have one doubt. Is the

Re: Consuming from Kafka to delta table - stream or batch mode?

2022-02-24 Thread Peyman Mohajerian
If you want to batch consume from Kafka, trigger-once config would work with structured streaming and you get the benefit of the checkpointing. On Thu, Feb 24, 2022 at 6:07 AM Michael Williams (SSI) < michael.willi...@ssigroup.com> wrote: > Hello, > > > > Our team is working with Spark (for the

Re: question for definition of column types

2022-01-26 Thread Peyman Mohajerian
from pyspark.sql.types import * list =[("buck trends", "ceo", 20.00, 0.25, "100")] schema = StructType([ StructField("name", StringType(), True), StructField("title", StringType(), True), StructField("salary", DoubleType(), True),

Re: DropNa in Spark for Columns

2021-02-27 Thread Peyman Mohajerian
I don't have personal experience with Koalas but it does seem to have the same api: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.dropna.html On Fri, Feb 26, 2021 at 11:46 PM Vitali Lupusor wrote: > Hello Chetan, > > I don’t know about Scala, but in PySpark

Re: Question on bucketing vs sorting

2020-12-31 Thread Peyman Mohajerian
ing this for me. > Would you say there's a case for using bucketing in this case at all, or > should I simply focus completely on the sorting solution? If so, when would > you say bucketing is the preferred solution? > > Patrik Iselind > > > On Thu, Dec 31, 2020 at 4:15 PM

Re: Question on bucketing vs sorting

2020-12-31 Thread Peyman Mohajerian
You can save your data to hdfs or other targets using either a sorted or bucketed dataframe. In the case of bucketing you will have a different data skipping mechanism when you read back the data compared to the sorted version. On Thu, Dec 31, 2020 at 5:40 AM Patrik Iselind wrote: > Hi

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Peyman Mohajerian
https://stackoverflow.com/questions/43484269/how-to-register-udf-to-use-in-sql-and-dataframe On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh wrote: > Hi, > > > This is a shot in the dark so to speak. > > > I would like to use the standard deviation std offered by numpy in > PySpark. I am using

Re: [Spark SQL]: Stability of large many-to-many joins

2020-03-20 Thread Peyman Mohajerian
Two options, either add salting to your join or filter records that are frequent, join them separately and the union back, it's the skew join issue. On Fri, Mar 20, 2020 at 4:12 AM nathan grand wrote: > Hi, > > I have two very large datasets, which both have many repeated keys, which I > wish

Re: Time-Series Forecasting

2018-09-29 Thread Peyman Mohajerian
Here's a blog on Flint: https://databricks.com/blog/2018/09/11/introducing-flint-a-time-series-library-for-apache-spark.html I don't have an opinion about it, just that Flint was mentioned earlier. On Thu, Sep 20, 2018 at 2:12 AM, Gourav Sengupta wrote: > Hi, > > If you are following the time

Re: pyspark vector

2017-04-24 Thread Peyman Mohajerian
setVocabSize On Mon, Apr 24, 2017 at 5:36 PM, Zeming Yu wrote: > Hi all, > > Beginner question: > > what does the 3 mean in the (3,[0,1,2],[1.0,1.0,1.0])? > > https://spark.apache.org/docs/2.1.0/ml-features.html > > id | texts | vector >

Re: Ingesting data in parallel across workers in Data Frame

2017-01-20 Thread Peyman Mohajerian
The next section in the same document has a solution. On Fri, Jan 20, 2017 at 9:03 PM, Abhishek Gupta wrote: > I am trying to load data from the database into DataFrame using JDBC > driver.I want to get data into partitions the following document has the > nice

Re: Spark SQL DataFrame to Kafka Topic

2017-01-13 Thread Peyman Mohajerian
Yes, it is called Structured Streaming: https://docs.databricks.com/_static/notebooks/structured-streaming-kafka.html http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html On Fri, Jan 13, 2017 at 3:32 AM, Senthil Kumar wrote: > Hi Team , > >

Re: [ML] Converting ml.DenseVector to mllib.Vector

2016-12-31 Thread Peyman Mohajerian
This may also help: http://spark.apache.org/docs/latest/ml-migration-guides.html On Sat, Dec 31, 2016 at 6:51 AM, Marco Mistroni wrote: > Hi. > you have a DataFrame.. there should be either a way to > - convert a DF to a Vector without doing a cast > - use a ML library

Re: how to find NaN values of each row of spark dataframe to decide whether the rows is dropeed or not

2016-09-26 Thread Peyman Mohajerian
Also take a look at this API: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions On Mon, Sep 26, 2016 at 1:09 AM, Bedrytski Aliaksandr wrote: > Hi Muhammet, > > python also supports sql queries

Re: how to decide which part of process use spark dataframe and pandas dataframe?

2016-09-26 Thread Peyman Mohajerian
A simple way to do that is to collect data in the driver when you need to use Python panda. On Monday, September 26, 2016, muhammet pakyürek wrote: > > > is there a clear guide to decide the above? >

Re: Spark Streaming-- for each new file in HDFS

2016-09-15 Thread Peyman Mohajerian
You can listen to files in a specific directory using: Take a look at: http://spark.apache.org/docs/latest/streaming-programming-guide.html streamingContext.fileStream On Thu, Sep 15, 2016 at 10:31 AM, Jörn Franke wrote: > Hi, > I recommend that the third party

Re: Scala Vs Python

2016-09-01 Thread Peyman Mohajerian
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html On Thu, Sep 1, 2016 at 3:01 PM, Mich Talebzadeh wrote: > Hi Jacob. > > My understanding of Dataset is that it is basically an RDD with some > optimization gone

Re: AnalysisException exception while parsing XML

2016-08-31 Thread Peyman Mohajerian
rk, i may have some part of this wrong, didn't test it, but something similar. On Wed, Aug 31, 2016 at 5:54 PM, <srikanth.je...@gmail.com> wrote: > How do we explode nested arrays? > > > > Thanks, > Sreekanth Jella > > > > *From: *Peyman Mohajerian <mohaj...@gma

Re: AnalysisException exception while parsing XML

2016-08-31 Thread Peyman Mohajerian
Once you get to the 'Array' type, you got to use explode, you cannot to the same traversing. On Wed, Aug 31, 2016 at 2:19 PM, wrote: > Hello Experts, > > > > I am using Spark XML package to parse the XML. Below exception is being > thrown when trying to *parse a tag

Re: update specifc rows to DB using sqlContext

2016-08-11 Thread Peyman Mohajerian
Alternatively, you should be able to write to a new table and use trigger or some other mechanism to update the particular row. I don't have any experience with this myself but just looking at this documentation:

Re: Machine learning question (suing spark)- removing redundant factors while doing clustering

2016-08-08 Thread Peyman Mohajerian
You can try 'feature Importances' or 'feature selection' depending on what else you want to do with the remaining features that's a possibility. Let's say you are trying to do classification then some of the Spark Libraries have a model parameter called 'featureImportances' that tell you which

Re: ML PipelineModel to be scored locally

2016-07-20 Thread Peyman Mohajerian
One option is to save the model in parquet or json format and then build your own prediction code. Some also use: https://github.com/jpmml/jpmml-sparkml It depends on the model, e.g. ml v mllib and other factors whether this works on or not. Couple of weeks ago there was a long discussion on

Re: Is that possible to feed web request via spark application directly?

2016-06-15 Thread Peyman Mohajerian
There are a variety of REST API services you can use, but you must consider carefully whether it makes sense to start a Spark job based on individual requests, unless you mean based on some triggering event you want to start a Spark job, in which case it makes sense to use the RESTful service.

Re: Apache Flink

2016-04-17 Thread Peyman Mohajerian
Microbatching is certainly not a waste of time, you are making way too strong of an statement. In fact in certain cases one tuple at the time makes no sense, it all depends on the use cases. In fact if you understand the history of the project Storm you would know that microbatching was added

Re: Spark replacing Hadoop

2016-04-14 Thread Peyman Mohajerian
Cloud adds another dimension: The fact that in cloud compute and storage is decoupled, s3-emr or blob-hdisight, means in cloud Hadoop ends up being more of a compute engine and a lot of the governance, security features are irrelevant or less important because data at rest is out of Hadoop.

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
na as well. > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich

Re: Sqoop on Spark

2016-04-06 Thread Peyman Mohajerian
For some MPP relational stores (not operational) it maybe feasible to run Spark jobs and also have data locality. I know QueryGrid (Teradata) and PolyBase (microsoft) use data locality to move data between their MPP and Hadoop. I would guess (have no idea) someone like IBM already is doing that

Re: Repeating Records w/ Spark + Avro?

2016-03-11 Thread Peyman Mohajerian
Here is the reason for the behavior: '''Note:''' Because Hadoop's RecordReader class re-uses the same Writable object for each record, directly caching the returned RDD or directly passing it to an aggregation or shuffle operation will create many references to the same object. If you plan to

Re: Is Spark right for us?

2016-03-06 Thread Peyman Mohajerian
if your relational database has enough computing power, you don't have to change it. You can just run SQL queries on top of it or even run Spark queries over it. There is no hard-fast rule about using big data tools. Usually people or organizations don't jump into big data for one specific use

Re: Aster Functions equivalent in spark : cfilter, npath and sessionize

2015-10-29 Thread Peyman Mohajerian
Some of the Aster functions you are referring to can be done using Window functions in SparkSQL: https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html On Thu, Oct 29, 2015 at 12:16 PM, didier vila wrote: > Good morning all, > > I am

Re: Is there any Spark SQL reference manual?

2015-09-11 Thread Peyman Mohajerian
http://docs.datastax.com/en/datastax_enterprise/4.6/datastax_enterprise/spark/sparkSqlSupportedSyntax.html On Fri, Sep 11, 2015 at 8:15 AM, Richard Hillegas wrote: > The latest Derby SQL Reference manual (version 10.11) can be found here: >

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Peyman Mohajerian
This question was answered with sample code a couple of days ago, please look back. On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic zoran.jere...@gmail.com wrote: Hi, I discovered what is the problem here. Twitter public stream is limited to 1% of overall tweets (https://goo.gl/kDwnyS), so

Re: Twitter streaming with apache spark stream only a small amount of tweets

2015-07-29 Thread Peyman Mohajerian
follow. Thanks, Zoran On Wed, Jul 29, 2015 at 8:40 AM, Peyman Mohajerian mohaj...@gmail.com wrote: This question was answered with sample code a couple of days ago, please look back. On Sat, Jul 25, 2015 at 11:43 PM, Zoran Jeremic zoran.jere...@gmail.com wrote: Hi, I discovered what