from:"Pushkar.Gujar"

Re: help in copying data from one azure subscription to another azure subscription

2018-05-23 Thread Pushkar.Gujar

What are you using for storing data in those subscriptions? Datalake or
Blobs? There is Azure Data Factory already available that can do copy
between these cloud storage without having to go through spark

Thank you,
*Pushkar Gujar*

On Mon, May 21, 2018 at 8:59 AM, amit kumar singh 
wrote:

> HI Team,
>
> We are trying to move data between one azure subscription to another azure
> subscription is there a faster way to do through spark
>
> i am using distcp and its taking for ever
>
> thanks
> rohit
>

Re: Multiple CSV libs causes issues spark 2.1

2017-05-09 Thread Pushkar.Gujar

>
> df = spark.sqlContext.read.csv('out/df_in.csv')
>

shouldn't this be just -

df = spark.read.csv('out/df_in.csv')

sparkSession itself is in entry point to dataframes and SQL functionality .


Thank you,
*Pushkar Gujar*


On Tue, May 9, 2017 at 6:09 PM, Mark Hamstra 
wrote:

> Looks to me like it is a conflict between a Databricks library and Spark
> 2.1. That's an issue for Databricks to resolve or provide guidance.
>
> On Tue, May 9, 2017 at 2:36 PM, lucas.g...@gmail.com  > wrote:
>
>> I'm a bit confused by that answer, I'm assuming it's spark deciding which
>> lib to use.
>>
>> On 9 May 2017 at 14:30, Mark Hamstra  wrote:
>>
>>> This looks more like a matter for Databricks support than spark-user.
>>>
>>> On Tue, May 9, 2017 at 2:02 PM, lucas.g...@gmail.com <
>>> lucas.g...@gmail.com> wrote:
>>>
 df = spark.sqlContext.read.csv('out/df_in.csv')
>


> 17/05/09 15:51:29 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so
> recording the schema version 1.2.0
> 17/05/09 15:51:29 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> 17/05/09 15:51:30 WARN ObjectStore: Failed to get database
> global_temp, returning NoSuchObjectException
>


> Py4JJavaError: An error occurred while calling o72.csv.
> : java.lang.RuntimeException: Multiple sources found for csv 
> (*com.databricks.spark.csv.DefaultSource15,
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat*),
> please specify the fully qualified class name.
> at scala.sys.package$.error(package.scala:27)
> at org.apache.spark.sql.execution.datasources.DataSource$.looku
> pDataSource(DataSource.scala:591)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass$lzycompute(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.provid
> ingClass(DataSource.scala:86)
> at org.apache.spark.sql.execution.datasources.DataSource.resolv
> eRelation(DataSource.scala:325)
> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.sc
> ala:152)
> at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:415)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
> ssorImpl.java:57)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
> thodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
> at py4j.Gateway.invoke(Gateway.java:280)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.j
> ava:132)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:214) at
> java.lang.Thread.run(Thread.java:745)


 When I change our call to:

 df = spark.hiveContext.read \
 .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')
 \
 .load('df_in.csv)

 No such issue, I was under the impression (obviously wrongly) that
 spark would automatically pick the local lib.  We have the databricks
 library because other jobs still explicitly call it.

 Is the 'correct answer' to go through and modify so as to remove the
 databricks lib / remove it from our deploy?  Or should this just work?

 One of the things I find less helpful in the spark docs are when
 there's multiple ways to do it but no clear guidance on what those methods
 are intended to accomplish.

 Thanks!

>>>
>>>
>>
>

Re: Spark books

2017-05-03 Thread Pushkar.Gujar

*"I would suggest do not buy any book, just start with databricks community
edition"*

I dont agree with above , "Learning Spark" book  was definitely stepping
stone for me. All the basics that one beginner can/will need is covered in
very easy to understand format with examples. Great book! highly
recommended ..

Off course, one has to mature their learning curve by moving on to other
resources, Apache documentation and along with github repos are excellent
resources .


Thank you,
*Pushkar Gujar*


On Wed, May 3, 2017 at 8:16 PM, Neelesh Salian 
wrote:

> The Apache Spark documentation is good to begin with.
> All the programming guides, particularly.
>
>
> On Wed, May 3, 2017 at 5:07 PM, ayan guha  wrote:
>
>> I would suggest do not buy any book, just start with databricks community
>> edition
>>
>> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:
>>
>>> Well that is the nature of technology, ever evolving. There will always
>>> be new concepts. If you're trying to get started ASAP and the internet
>>> isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of
>>> production stacks are still on that version and the knowledge from
>>> mastering 1.6 is transferable to 2+. I think that beats waiting forever.
>>>
>>> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:
>>>
 I'm trying to decide whether to buy the book learning spark, spark for
 machine learning etc. or wait for a new edition covering the new concepts
 like dataframe and datasets. Anyone got any suggestions?

>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Regards,
> Neelesh S. Salian
>
>

Re: how to find the nearest holiday

2017-04-25 Thread Pushkar.Gujar

You can use
-
start_date_test2.holiday.getItem[0]

I would highly suggest you to look at latest documentation -
http://spark.apache.org/docs/latest/api/python/index.html 


Thank you,
*Pushkar Gujar*


On Tue, Apr 25, 2017 at 8:50 AM, Zeming Yu  wrote:

> How could I access the first element of the holiday column?
>
> I tried the following code, but it doesn't work:
> start_date_test2.withColumn("diff", datediff(start_date_test2.start_date,
> 
> start_date_test2.holiday*[0]*)).show()
>
> On Tue, Apr 25, 2017 at 10:20 PM, Zeming Yu  wrote:
>
>> Got it working now!
>>
>> Does anyone have a pyspark example of how to calculate the numbers of
>> days from the nearest holiday based on an array column?
>>
>> I.e. from this table
>>
>> +--+---+
>> |start_date|holiday|
>> +--+---+
>> |2017-08-11|[2017-05-30,2017-10-01]|
>>
>>
>> calculate a column called "days_from_nearest_holiday" which calculates the 
>> difference between 11 aug 2017 and 1 oct 2017?
>>
>>
>>
>>
>>
>> On Tue, Apr 25, 2017 at 6:00 PM, Wen Pei Yu  wrote:
>>
>>> TypeError: unorderable types: str() >= datetime.date()
>>>
>>> Should transfer string to Date type when compare.
>>>
>>> Yu Wenpei.
>>>
>>>
>>> - Original message -
>>> From: Zeming Yu 
>>> To: user 
>>> Cc:
>>> Subject: how to find the nearest holiday
>>> Date: Tue, Apr 25, 2017 3:39 PM
>>>
>>> I have a column of dates (date type), just trying to find the nearest
>>> holiday of the date. Anyone has any idea what went wrong below?
>>>
>>>
>>>
>>> start_date_test = flight3.select("start_date").distinct()
>>> start_date_test.show()
>>>
>>> holidays = ['2017-09-01', '2017-10-01']
>>>
>>> +--+
>>> |start_date|
>>> +--+
>>> |2017-08-11|
>>> |2017-09-11|
>>> |2017-09-28|
>>> |2017-06-29|
>>> |2017-09-29|
>>> |2017-07-31|
>>> |2017-08-14|
>>> |2017-08-18|
>>> |2017-04-09|
>>> |2017-09-21|
>>> |2017-08-10|
>>> |2017-06-30|
>>> |2017-08-19|
>>> |2017-07-06|
>>> |2017-06-28|
>>> |2017-09-14|
>>> |2017-08-08|
>>> |2017-08-22|
>>> |2017-07-03|
>>> |2017-07-30|
>>> +--+
>>> only showing top 20 rows
>>>
>>>
>>>
>>> index = spark.sparkContext.broadcast(sorted(holidays))
>>>
>>> def nearest_holiday(date):
>>> last_holiday = index.value[0]
>>> for next_holiday in index.value:
>>> if next_holiday >= date:
>>> break
>>> last_holiday = next_holiday
>>> if last_holiday > date:
>>> last_holiday = None
>>> if next_holiday < date:
>>> next_holiday = None
>>> return (last_holiday, next_holiday)
>>>
>>>
>>> from pyspark.sql.types import *
>>> return_type = StructType([StructField('last_holiday', StringType()),
>>> StructField('next_holiday', StringType())])
>>>
>>> from pyspark.sql.functions import udf
>>> nearest_holiday_udf = udf(nearest_holiday, return_type)
>>>
>>> start_date_test.withColumn('holiday', 
>>> nearest_holiday_udf('start_date')).show(5,
>>> False)
>>>
>>>
>>> here's the error I got:
>>>
>>> 
>>> ---
>>> Py4JJavaError Traceback (most recent call
>>> last)
>>>  in ()
>>>  24 nearest_holiday_udf = udf(nearest_holiday, return_type)
>>>  25
>>> ---> 26 start_date_test.withColumn('holiday', nearest_holiday_udf(
>>> 'start_date')).show(5, False)
>>>
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\pytho
>>> n\pyspark\sql\dataframe.py in show(self, n, truncate)
>>> 318 print(self._jdf.showString(n, 20))
>>> 319 else:
>>> --> 320 print(self._jdf.showString(n, int(truncate)))
>>> 321
>>> 322 def __repr__(self):
>>>
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\pytho
>>> n\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py in __call__(self, *args)
>>>1131 answer = self.gateway_client.send_command(command)
>>>1132 return_value = get_return_value(
>>> -> 1133 answer, self.gateway_client, self.target_id,
>>> self.name)
>>>1134
>>>1135 for temp_arg in temp_args:
>>>
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\pytho
>>> n\pyspark\sql\utils.py in deco(*a, **kw)
>>>  61 def deco(*a, **kw):
>>>  62 try:
>>> ---> 63 return f(*a, **kw)
>>>  64 except py4j.protocol.Py4JJavaError as e:
>>>  65 s = e.java_exception.toString()
>>>
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\pytho
>>> n\lib\py4j-0.10.4-src.zip\py4j\protocol.py in get_return_value(answer,
>>> gateway_client, target_id, name)
>>> 317 raise Py4JJavaError(
>>> 318 "An error occurred while calling
>>> {0}{1}{2}.\n".
>>> --> 319 format(target_id, ".", name), value)
>>> 320 else:
>>> 321

Re: udf that handles null values

2017-04-24 Thread Pushkar.Gujar

Someone had similar issue today at stackoverflow.

http://stackoverflow.com/questions/43595201/python-how-to-convert-pyspark-column-to-date-type-if-there-are-null-values/43595728#43595728



Thank you,
*Pushkar Gujar*


On Mon, Apr 24, 2017 at 8:22 PM, Zeming Yu  wrote:

> hi all,
>
> I tried to write a UDF that handles null values:
>
> def getMinutes(hString, minString):
> if (hString != None) & (minString != None): return int(hString) * 60 +
> int(minString[:-1])
> else: return None
>
> flight2 = (flight2.withColumn("duration_minutes",
> udfGetMinutes("duration_h", "duration_m")))
>
>
> but I got this error:
>
>   File "", line 6, in getMinutes
> TypeError: int() argument must be a string, a bytes-like object or a number, 
> not 'NoneType'
>
>
> Does anyone know how to do this?
>
>
> Thanks,
>
> Zeming
>
>

Re: question regarding pyspark

2017-04-21 Thread Pushkar.Gujar

 Hi Afshin,

If you need to associate header information from 2nd file to first one i.e.
, you can do that with specifying custom schema. Below is example from
spark-csv package.   As you can guess, you will have to do some
pre-processing to create customSchema by first reading second file .

val customSchema = StructType(Array(
StructField("year", IntegerType, true),
StructField("make", StringType, true),
StructField("model", StringType, true),
StructField("comment", StringType, true),
StructField("blank", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.schema(customSchema)
.load("cars.csv")



Thank you,
*Pushkar Gujar*


On Fri, Apr 21, 2017 at 7:37 PM, Afshin, Bardia <
bardia.afs...@capitalone.com> wrote:

> I’m ingesting a CSV with hundreds of columns and the original CSV file
> it’self doesn’t have any header. I do have a separate file that is just the
> headers, is there a way to tell Spark API this information when loading the
> CSV file? Or do I have to do some preprocesisng before doing so?
>
>
>
> Thanks,
>
> Bardia Afshin
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: how to add new column using regular expression within pyspark dataframe

2017-04-20 Thread Pushkar.Gujar

Can be as  simple as -

from pyspark.sql.functions import split

flight.withColumn('hour',split(flight.duration,'h').getItem(0))


Thank you,
*Pushkar Gujar*


On Thu, Apr 20, 2017 at 4:35 AM, Zeming Yu  wrote:

> Any examples?
>
> On 20 Apr. 2017 3:44 pm, "颜发才(Yan Facai)"  wrote:
>
>> How about using `withColumn` and UDF?
>>
>> example:
>> + https://gist.github.com/zoltanctoth/2deccd69e3d1cde1dd78
>> 
>> + https://ragrawal.wordpress.com/2015/10/02/spark-custom-udf-example/
>>
>>
>>
>> On Mon, Apr 17, 2017 at 8:25 PM, Zeming Yu  wrote:
>>
>>> I've got a dataframe with a column looking like this:
>>>
>>> display(flight.select("duration").show())
>>>
>>> ++
>>> |duration|
>>> ++
>>> |  15h10m|
>>> |   17h0m|
>>> |  21h25m|
>>> |  14h30m|
>>> |  24h50m|
>>> |  26h10m|
>>> |  14h30m|
>>> |   23h5m|
>>> |  21h30m|
>>> |  11h50m|
>>> |  16h10m|
>>> |  15h15m|
>>> |  21h25m|
>>> |  14h25m|
>>> |  14h40m|
>>> |   16h0m|
>>> |  24h20m|
>>> |  14h30m|
>>> |  14h25m|
>>> |  14h30m|
>>> ++
>>> only showing top 20 rows
>>>
>>>
>>>
>>> I need to extract the hour as a number and store it as an additional
>>> column within the same dataframe. What's the best way to do that?
>>>
>>>
>>> I tried the following, but it failed:
>>>
>>> import re
>>> def getHours(x):
>>>   return re.match('([0-9]+(?=h))', x)
>>> temp = flight.select("duration").rdd.map(lambda x:getHours(x[0])).toDF()
>>> temp.select("duration").show()
>>>
>>>
>>> error message:
>>>
>>>
>>> ---Py4JJavaError
>>>  Traceback (most recent call 
>>> last) in ()  2 def getHours(x):  
>>> 3   return re.match('([0-9]+(?=h))', x)> 4 temp = 
>>> flight.select("duration").rdd.map(lambda x:getHours(x[0])).toDF()  5 
>>> temp.select("duration").show()
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\session.py
>>>  in toDF(self, schema, sampleRatio) 55 [Row(name=u'Alice', 
>>> age=1)] 56 """---> 57 return 
>>> sparkSession.createDataFrame(self, schema, sampleRatio) 58  59 
>>> RDD.toDF = toDF
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\session.py
>>>  in createDataFrame(self, data, schema, samplingRatio, verifySchema)518 
>>> 519 if isinstance(data, RDD):--> 520 rdd, schema = 
>>> self._createFromRDD(data.map(prepare), schema, samplingRatio)521
>>>  else:522 rdd, schema = self._createFromLocal(map(prepare, 
>>> data), schema)
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\session.py
>>>  in _createFromRDD(self, rdd, schema, samplingRatio)358 """
>>> 359 if schema is None or isinstance(schema, (list, tuple)):--> 360  
>>>struct = self._inferSchema(rdd, samplingRatio)361
>>>  converter = _create_converter(struct)362 rdd = 
>>> rdd.map(converter)
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\sql\session.py
>>>  in _inferSchema(self, rdd, samplingRatio)329 :return: 
>>> :class:`pyspark.sql.types.StructType`330 """--> 331 
>>> first = rdd.first()332 if not first:333 raise 
>>> ValueError("The first row in RDD is empty, "
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\rdd.py
>>>  in first(self)   1359 ValueError: RDD is empty   1360 
>>> """-> 1361 rs = self.take(1)   1362 if rs:   1363   
>>>   return rs[0]
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\rdd.py
>>>  in take(self, num)   13411342 p = range(partsScanned, 
>>> min(partsScanned + numPartsToTry, totalParts))-> 1343 res = 
>>> self.context.runJob(self, takeUpToNumLeft, p)   13441345 
>>> items += res
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\pyspark\context.py
>>>  in runJob(self, rdd, partitionFunc, partitions, allowLocal)963 
>>> # SparkContext#runJob.964 mappedRDD = 
>>> rdd.mapPartitions(partitionFunc)--> 965 port = 
>>> self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
>>> 966 return list(_load_from_socket(port, 
>>> mappedRDD._jrdd_deserializer))967
>>> C:\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py
>>>  in __call__(self, *args)   1131 answer = 
>>> self.gateway_client.send_command(command)   1132 return_value = 
>>> get_return_value(-> 1133 answer, self.gateway_client, 
>>> self.target_id, self.name)   11341135 for temp_arg in temp_args:
>>>

Re: Optimisation Tips

2017-04-12 Thread Pushkar.Gujar

Not a expert, but groupByKey operation is well known to cause lot of
shuffling and usually operation performed by groupbykey operation can be
replaced by reducebykey.

Here is great article on groupByKey operation -

https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey


this whole repository - "spark-gotchas" is filled with lot of helpful tips.




Thank you,
*Pushkar Gujar*


On Wed, Apr 12, 2017 at 10:48 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi Steve,
>
> I have implemented repartitions on dataframe to 1. It helped the
> performance but not to a great extent. I am also looking for answers from
> the experts.
>
> Thanks,
> Asmath
>
> On Wed, Apr 12, 2017 at 9:45 AM, Steve Robinson <
> steve.robin...@aquilainsight.com> wrote:
>
>> Hi,
>>
>>
>> Does anyone have any optimisation tips or could propose an alternative
>> way to perform the below:
>>
>>
>> val groupedUserItems1 = userItems1.groupByKey{_.customer_id}
>> val groupedUserItems2 = userItems2.groupByKey{_.customer_id}
>> groupedUserItems1.cogroup(groupedUserItems2){
>>case (_, userItems1, userItems2) =>
>> processSingleUser(userItems1, userItems2)
>> }
>> }
>>
>> The userItems1 and userItems2 datasets are quite large (100's millions of
>> records) so I'm finding the shuffle stage is shuffing Gigabytes of data.
>>
>> Any help would be greatly appreciated.
>>
>> Thanks,
>>
>>
>> Steve Robinson
>>
>> steve.robin...@aquilainsight.com
>> 0131 290 2300
>>
>>
>> www.aquilainsight.com
>> linkedin.com/aquilainsight
>> 
>>
>> twitter.com/aquilainsight
>>
>
>

Re: How to run a spark on Pycharm

2017-03-03 Thread Pushkar.Gujar

There are lot of articles available online which guide you thru setting up
jupyter notebooks to run spark program. For e.g -

http://blog.insightdatalabs.com/jupyter-on-apache-spark-step-by-step/
https://www.cloudera.com/documentation/enterprise/5-6-x/topics/spark_ipython.html
https://gist.github.com/tommycarpi/f5a67c66a8f2170e263c




Thank you,
*Pushkar Gujar*


On Fri, Mar 3, 2017 at 10:05 AM, Anahita Talebi <anahita.t.am...@gmail.com>
wrote:

> Hi,
>
> Thanks for your answer.
>
> Sorry, I am completely beginner in running the code in spark.
>
> Could you please tell me a bit more in details how to do that?
> I installed ipython and Jupyter notebook on my local machine. But how can
> I run the code using them? Before, I tried to run the code with Pycharm
> that I was failed.
>
> Thanks,
> Anahita
>
> On Fri, Mar 3, 2017 at 3:48 PM, Pushkar.Gujar <pushkarvgu...@gmail.com>
> wrote:
>
>> Jupyter notebook/ipython can be connected to apache spark
>>
>>
>> Thank you,
>> *Pushkar Gujar*
>>
>>
>> On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi <anahita.t.am...@gmail.com
>> > wrote:
>>
>>> Hi everyone,
>>>
>>> I am trying to run a spark code on Pycharm. I tried to give the path of
>>> spark as a environment variable to the configuration of Pycharm.
>>> Unfortunately, I get the error. Does anyone know how I can run the spark
>>> code on Pycharm?
>>> It shouldn't be necessarily on Pycharm. if you know any other software,
>>> It would be nice to tell me.
>>>
>>> Thanks a lot,
>>> Anahita
>>>
>>>
>>>
>>
>

Re: How to run a spark on Pycharm

2017-03-03 Thread Pushkar.Gujar

Jupyter notebook/ipython can be connected to apache spark


Thank you,
*Pushkar Gujar*


On Fri, Mar 3, 2017 at 9:43 AM, Anahita Talebi 
wrote:

> Hi everyone,
>
> I am trying to run a spark code on Pycharm. I tried to give the path of
> spark as a environment variable to the configuration of Pycharm.
> Unfortunately, I get the error. Does anyone know how I can run the spark
> code on Pycharm?
> It shouldn't be necessarily on Pycharm. if you know any other software, It
> would be nice to tell me.
>
> Thanks a lot,
> Anahita
>
>
>

Re: help in copying data from one azure subscription to another azure subscription

Re: Multiple CSV libs causes issues spark 2.1

Re: Spark books

Re: how to find the nearest holiday

Re: udf that handles null values

Re: question regarding pyspark

Re: how to add new column using regular expression within pyspark dataframe

Re: Optimisation Tips

Re: How to run a spark on Pycharm

Re: How to run a spark on Pycharm

10 matches

Site Navigation

Mail list logo

Footer information