Re: Spark for client

2016-03-01 Thread Mohannad Ali
Jupyter (http://jupyter.org/) also supports Spark and generally it's a
beast allows you to do so much more.
On Mar 1, 2016 00:25, "Mich Talebzadeh"  wrote:

> Thank you very much both
>
> Zeppelin looks promising. Basically as I understand runs an agent on a
> given port (I chose 21999) on the host that Spark is installed. I created a
> notebook and running scripts through there. One thing for sure notebook
> just returns the results rather all other stuff that one does not need/.
>
> Cheers,
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 29 February 2016 at 19:22, Minudika Malshan 
> wrote:
>
>> +Adding resources
>> https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html
>> https://zeppelin.incubator.apache.org
>>
>> Minudika Malshan
>> Undergraduate
>> Department of Computer Science and Engineering
>> University of Moratuwa.
>> *Mobile : +94715659887 <%2B94715659887>*
>> *LinkedIn* : https://lk.linkedin.com/in/minudika
>>
>>
>>
>> On Tue, Mar 1, 2016 at 12:51 AM, Minudika Malshan 
>> wrote:
>>
>>> Hi,
>>>
>>> I think zeppelin spark interpreter will give a solution to your problem.
>>>
>>> Regards.
>>> Minudika
>>>
>>> Minudika Malshan
>>> Undergraduate
>>> Department of Computer Science and Engineering
>>> University of Moratuwa.
>>> *Mobile : +94715659887 <%2B94715659887>*
>>> *LinkedIn* : https://lk.linkedin.com/in/minudika
>>>
>>>
>>>
>>> On Tue, Mar 1, 2016 at 12:35 AM, Sabarish Sasidharan <
>>> sabarish.sasidha...@manthan.com> wrote:
>>>
 Zeppelin?

 Regards
 Sab
 On 01-Mar-2016 12:27 am, "Mich Talebzadeh" 
 wrote:

> Hi,
>
> Is there such thing as Spark for client much like RDBMS client that
> have cut down version of their big brother useful for client connectivity
> but cannot be used as server.
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

>>>
>>
>


Re: Support virtualenv in PySpark

2016-03-01 Thread Mohannad Ali
Hello Jeff,

Well this would also mean that you have to manage the same virtualenv (same
path) on all nodes and install your packages to it the same way you would
if you would install the packages to the default python path.

In any case at the moment you can already do what you proposed by creating
identical virtualenvs on all nodes on the same path and change the spark
python path to point to the virtualenv.

Best Regards,
Mohannad
On Mar 1, 2016 06:07, "Jeff Zhang"  wrote:

> I have created jira for this feature , comments and feedback are welcome
> about how to improve it and whether it's valuable for users.
>
> https://issues.apache.org/jira/browse/SPARK-13587
>
>
> Here's some background info and status of this work.
>
>
> Currently, it's not easy for user to add third party python packages in
> pyspark.
>
>- One way is to using --py-files (suitable for simple dependency, but
>not suitable for complicated dependency, especially with transitive
>dependency)
>- Another way is install packages manually on each node (time wasting,
>and not easy to switch to different environment)
>
> Python now has 2 different virtualenv implementation. One is native
> virtualenv another is through conda.
>
> I have implemented POC for this features. Here's one simple command for
> how to use virtualenv in pyspark
>
> bin/spark-submit --master yarn --deploy-mode client --conf 
> "spark.pyspark.virtualenv.enabled=true" --conf 
> "spark.pyspark.virtualenv.type=conda" --conf 
> "spark.pyspark.virtualenv.requirements=/Users/jzhang/work/virtualenv/conda.txt"
>  --conf "spark.pyspark.virtualenv.path=/Users/jzhang/anaconda/bin/conda"  
> ~/work/virtualenv/spark.py
>
> There're 4 properties needs to be set
>
>- spark.pyspark.virtualenv.enabled (enable virtualenv)
>- spark.pyspark.virtualenv.type (native/conda are supported, default
>is native)
>- spark.pyspark.virtualenv.requirements (requirement file for the
>dependencies)
>- spark.pyspark.virtualenv.path (path to the executable for for
>virtualenv/conda)
>
>
>
>
>
>
> Best Regards
>
> Jeff Zhang
>


Re: Using functional programming rather than SQL

2016-02-24 Thread Mohannad Ali
My apologies I definitely misunderstood. You are 100% correct.
On Feb 24, 2016 19:25, "Sabarish Sasidharan" <
sabarish.sasidha...@manthan.com> wrote:

> I never said it needs one. All I said is that when calling context.sql()
> the sql is executed in the source database (assuming datasource is Hive or
> some RDBMS)
>
> Regards
> Sab
>
> Regards
> Sab
> On 24-Feb-2016 11:49 pm, "Mohannad Ali" <man...@gmail.com> wrote:
>
>> That is incorrect HiveContext does not need a hive instance to run.
>> On Feb 24, 2016 19:15, "Sabarish Sasidharan" <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> Yes
>>>
>>> Regards
>>> Sab
>>> On 24-Feb-2016 9:15 pm, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>
>>>> are you saying that HiveContext.sql(...) runs on hive, and not on
>>>> spark sql?
>>>>
>>>> On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan <
>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>
>>>>> When using SQL your full query, including the joins, were executed in
>>>>> Hive(or RDBMS) and only the results were brought into the Spark cluster. 
>>>>> In
>>>>> the FP case, the data for the 3 tables is first pulled into the Spark
>>>>> cluster and then the join is executed.
>>>>>
>>>>> Thus the time difference.
>>>>>
>>>>> It's not immediately obvious why the results are different.
>>>>>
>>>>> Regards
>>>>> Sab
>>>>> On 24-Feb-2016 5:40 am, "Mich Talebzadeh" <
>>>>> mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> First thanks everyone for their suggestions. Much appreciated.
>>>>>>
>>>>>> This was the original queries written in SQL and run against
>>>>>> Spark-shell
>>>>>>
>>>>>> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>>>> println ("\nStarted at"); HiveContext.sql("SELECT
>>>>>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
>>>>>> ").collect.foreach(println)
>>>>>> HiveContext.sql("use oraclehadoop")
>>>>>>
>>>>>> val rs = HiveContext.sql(
>>>>>> """
>>>>>> SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
>>>>>> TotalSales
>>>>>> FROM smallsales s
>>>>>> INNER JOIN times t
>>>>>> ON s.time_id = t.time_id
>>>>>> INNER JOIN channels c
>>>>>> ON s.channel_id = c.channel_id
>>>>>> GROUP BY t.calendar_month_desc, c.channel_desc
>>>>>> """)
>>>>>> rs.registerTempTable("tmp")
>>>>>> println ("\nfirst query")
>>>>>> HiveContext.sql("""
>>>>>> SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL,
>>>>>> TotalSales
>>>>>> from tmp
>>>>>> ORDER BY MONTH, CHANNEL LIMIT 5
>>>>>> """).collect.foreach(println)
>>>>>> println ("\nsecond query")
>>>>>> HiveContext.sql("""
>>>>>> SELECT channel_desc AS CHANNEL, MAX(TotalSales)  AS SALES
>>>>>> FROM tmp
>>>>>> GROUP BY channel_desc
>>>>>> order by SALES DESC LIMIT 5
>>>>>> """).collect.foreach(println)
>>>>>> println ("\nFinished at"); HiveContext.sql("SELECT
>>>>>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
>>>>>> ").collect.foreach(println)
>>>>>> sys.exit
>>>>>>
>>>>>> The second queries were written in FP as much as I could as below
>>>>>>
>>>>>> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>>>>> println ("\nStarted at"); HiveContext.sql("SELECT
>>>>>> FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
>>>>>> ").collect.foreach(println)
>>>>>> HiveContext.sql("use oraclehadoop")
>>>>>&g

Re: Using functional programming rather than SQL

2016-02-24 Thread Mohannad Ali
That is incorrect HiveContext does not need a hive instance to run.
On Feb 24, 2016 19:15, "Sabarish Sasidharan" <
sabarish.sasidha...@manthan.com> wrote:

> Yes
>
> Regards
> Sab
> On 24-Feb-2016 9:15 pm, "Koert Kuipers"  wrote:
>
>> are you saying that HiveContext.sql(...) runs on hive, and not on spark
>> sql?
>>
>> On Wed, Feb 24, 2016 at 1:27 AM, Sabarish Sasidharan <
>> sabarish.sasidha...@manthan.com> wrote:
>>
>>> When using SQL your full query, including the joins, were executed in
>>> Hive(or RDBMS) and only the results were brought into the Spark cluster. In
>>> the FP case, the data for the 3 tables is first pulled into the Spark
>>> cluster and then the join is executed.
>>>
>>> Thus the time difference.
>>>
>>> It's not immediately obvious why the results are different.
>>>
>>> Regards
>>> Sab
>>> On 24-Feb-2016 5:40 am, "Mich Talebzadeh" <
>>> mich.talebza...@cloudtechnologypartners.co.uk> wrote:
>>>


 Hi,

 First thanks everyone for their suggestions. Much appreciated.

 This was the original queries written in SQL and run against Spark-shell

 val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 println ("\nStarted at"); HiveContext.sql("SELECT
 FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
 ").collect.foreach(println)
 HiveContext.sql("use oraclehadoop")

 val rs = HiveContext.sql(
 """
 SELECT t.calendar_month_desc, c.channel_desc, SUM(s.amount_sold) AS
 TotalSales
 FROM smallsales s
 INNER JOIN times t
 ON s.time_id = t.time_id
 INNER JOIN channels c
 ON s.channel_id = c.channel_id
 GROUP BY t.calendar_month_desc, c.channel_desc
 """)
 rs.registerTempTable("tmp")
 println ("\nfirst query")
 HiveContext.sql("""
 SELECT calendar_month_desc AS MONTH, channel_desc AS CHANNEL, TotalSales
 from tmp
 ORDER BY MONTH, CHANNEL LIMIT 5
 """).collect.foreach(println)
 println ("\nsecond query")
 HiveContext.sql("""
 SELECT channel_desc AS CHANNEL, MAX(TotalSales)  AS SALES
 FROM tmp
 GROUP BY channel_desc
 order by SALES DESC LIMIT 5
 """).collect.foreach(println)
 println ("\nFinished at"); HiveContext.sql("SELECT
 FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
 ").collect.foreach(println)
 sys.exit

 The second queries were written in FP as much as I could as below

 val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
 println ("\nStarted at"); HiveContext.sql("SELECT
 FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
 ").collect.foreach(println)
 HiveContext.sql("use oraclehadoop")
 var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
 sales")
 val c = HiveContext.sql("SELECT CHANNEL_ID, CHANNEL_DESC FROM channels")
 val t = HiveContext.sql("SELECT TIME_ID, CALENDAR_MONTH_DESC FROM
 times")
 val rs =
 s.join(t,"time_id").join(c,"channel_id").groupBy("calendar_month_desc","channel_desc").agg(sum("amount_sold").as("TotalSales"))
 println ("\nfirst query")
 val rs1 =
 rs.orderBy("calendar_month_desc","channel_desc").take(5).foreach(println)
 println ("\nsecond query")
 val rs2
 =rs.groupBy("channel_desc").agg(max("TotalSales").as("SALES")).orderBy("SALES").sort(desc("SALES")).take(5).foreach(println)
 println ("\nFinished at"); HiveContext.sql("SELECT
 FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss')
 ").collect.foreach(println)
 sys.exit



 However The first query results are slightly different in SQL and FP
 (may be the first query code in FP is not exactly correct?) and more
 importantly the FP takes order of magnitude longer compared to SQL (8
 minutes compared to less than a minute). I am not surprised as I expected
 Functional Programming has to flatten up all those method calls and convert
 them to SQL?

 *The standard SQL results*



 Started at
 [23/02/2016 23:55:30.30]
 res1: org.apache.spark.sql.DataFrame = [result: string]
 rs: org.apache.spark.sql.DataFrame = [calendar_month_desc: string,
 channel_desc: string, TotalSales: decimal(20,0)]

 first query
 [1998-01,Direct Sales,9161730]
 [1998-01,Internet,1248581]
 [1998-01,Partners,2409776]
 [1998-02,Direct Sales,9161840]
 [1998-02,Internet,1533193]



 second query
 [Direct Sales,9161840]
 [Internet,3977374]
 [Partners,3976291]
 [Tele Sales,328760]

 Finished at
 [23/02/2016 23:56:11.11]

 *The FP results*

 Started at
 [23/02/2016 23:45:58.58]
 res1: org.apache.spark.sql.DataFrame = [result: string]
 s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0),
 TIME_ID: timestamp, CHANNEL_ID: bigint]
 c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: double, CHANNEL_DESC:
 string]

Re: Spark Job Hanging on Join

2016-02-23 Thread Mohannad Ali
Hello Everyone,

Thanks a lot for the help. We also managed to solve it but without
resorting to spark 1.6.

The problem we were having was because of a really bad join condition:

ON ((a.col1 = b.col1) or (a.col1 is null and b.col1 is null)) AND ((a.col2
= b.col2) or (a.col2 is null and b.col2 is null))

So what we did was re-work our logic to remove the null checks in the join
condition and the join went lightning fast afterwards :)
On Feb 22, 2016 21:24, "Dave Moyers" <davemoy...@icloud.com> wrote:

> Good article! Thanks for sharing!
>
>
> > On Feb 22, 2016, at 11:10 AM, Davies Liu <dav...@databricks.com> wrote:
> >
> > This link may help:
> >
> https://forums.databricks.com/questions/6747/how-do-i-get-a-cartesian-product-of-a-huge-dataset.html
> >
> > Spark 1.6 had improved the CatesianProduct, you should turn of auto
> > broadcast and go with CatesianProduct in 1.6
> >
> > On Mon, Feb 22, 2016 at 1:45 AM, Mohannad Ali <man...@gmail.com> wrote:
> >> Hello everyone,
> >>
> >> I'm working with Tamara and I wanted to give you guys an update on the
> >> issue:
> >>
> >> 1. Here is the output of .explain():
> >>>
> >>> Project
> >>>
> [sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L,customer_id#25L
> >>> AS new_customer_id#38L,country#24 AS new_country#39,email#26 AS
> >>> new_email#40,birthdate#29 AS new_birthdate#41,gender#31 AS
> >>> new_gender#42,fk_created_at_date#32 AS
> >>> new_fk_created_at_date#43,age_range#30 AS
> new_age_range#44,first_name#27 AS
> >>> new_first_name#45,last_name#28 AS new_last_name#46]
> >>> BroadcastNestedLoopJoin BuildLeft, LeftOuter, Somecustomer_id#1L =
> >>> customer_id#25L) || (isnull(customer_id#1L) &&
> isnull(customer_id#25L))) &&
> >>> ((country#2 = country#24) || (isnull(country#2) &&
> isnull(country#24)
> >>>  Scan
> >>>
> PhysicalRDD[country#24,customer_id#25L,email#26,first_name#27,last_name#28,birthdate#29,age_range#30,gender#31,fk_created_at_date#32]
> >>>  Scan
> >>>
> ParquetRelation[hdfs:///databases/dimensions/customer_dimension][sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L]
> >>
> >>
> >> 2. Setting spark.sql.autoBroadcastJoinThreshold=-1 didn't make a
> difference.
> >> It still hangs indefinitely.
> >> 3. We are using Spark 1.5.2
> >> 4. We tried running this with 4 executors, 9 executors, and even in
> local
> >> mode with master set to "local[4]". The issue still persists in all
> cases.
> >> 5. Even without trying to cache any of the dataframes this issue still
> >> happens,.
> >> 6. We have about 200 partitions.
> >>
> >> Any help would be appreciated!
> >>
> >> Best Regards,
> >> Mo
> >>
> >> On Sun, Feb 21, 2016 at 8:39 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com>
> >> wrote:
> >>>
> >>> Sorry,
> >>>
> >>> please include the following questions to the list above:
> >>>
> >>> the SPARK version?
> >>> whether you are using RDD or DataFrames?
> >>> is the code run locally or in SPARK Cluster mode or in AWS EMR?
> >>>
> >>>
> >>> Regards,
> >>> Gourav Sengupta
> >>>
> >>> On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta
> >>> <gourav.sengu...@gmail.com> wrote:
> >>>>
> >>>> Hi Tamara,
> >>>>
> >>>> few basic questions first.
> >>>>
> >>>> How many executors are you using?
> >>>> Is the data getting all cached into the same executor?
> >>>> How many partitions do you have of the data?
> >>>> How many fields are you trying to use in the join?
> >>>>
> >>>> If you need any help in finding answer to these questions please let
> me
> >>>> know. From what I reckon joins like yours should not take more than a
> few
> >>>> milliseconds.
> >>>>
> >>>>
> >>>> Regards,
> >>>> Gourav Sengupta
> >>>>
> >>>> On Fri, Feb 19, 2016 at 5:31 PM, Tamara Mendt <t...@hellofresh.com>
> wrote:
> >>>>&g

Re: spark.driver.maxResultSize doesn't work in conf-file

2016-02-22 Thread Mohannad Ali
In spark-defaults you put the values like "spark.driver.maxResultSize
 0" instead of "spark.driver.maxResultSize=0" I think.

On Sat, Feb 20, 2016 at 3:40 PM, AlexModestov 
wrote:

> I have a string spark.driver.maxResultSize=0 in the spark-defaults.conf.
> But I get an error:
>
> "org.apache.spark.SparkException: Job aborted due to stage failure: Total
> size of serialized results of 18 tasks (1070.5 MB) is bigger than
> spark.driver.maxResultSize (1024.0 MB)"
>
> But if I write --conf spark.driver.maxResultSize=0 in pyspark-shell it
> works
> fine.
>
> Could anyone know how to fix it?
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-driver-maxResultSize-doesn-t-work-in-conf-file-tp26279.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark Job Hanging on Join

2016-02-22 Thread Mohannad Ali
Hello everyone,

I'm working with Tamara and I wanted to give you guys an update on the
issue:

1. Here is the output of .explain():

> Project
> [sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L,customer_id#25L
> AS new_customer_id#38L,country#24 AS new_country#39,email#26 AS
> new_email#40,birthdate#29 AS new_birthdate#41,gender#31 AS
> new_gender#42,fk_created_at_date#32 AS
> new_fk_created_at_date#43,age_range#30 AS new_age_range#44,first_name#27 AS
> new_first_name#45,last_name#28 AS new_last_name#46]
>  BroadcastNestedLoopJoin BuildLeft, LeftOuter, Somecustomer_id#1L =
> customer_id#25L) || (isnull(customer_id#1L) && isnull(customer_id#25L))) &&
> ((country#2 = country#24) || (isnull(country#2) && isnull(country#24)
>   Scan
> PhysicalRDD[country#24,customer_id#25L,email#26,first_name#27,last_name#28,birthdate#29,age_range#30,gender#31,fk_created_at_date#32]
>   Scan
> ParquetRelation[hdfs:///databases/dimensions/customer_dimension][sk_customer#0L,customer_id#1L,country#2,email#3,birthdate#4,gender#5,fk_created_at_date#6,age_range#7,first_name#8,last_name#9,inserted_at#10L,updated_at#11L]


2. Setting spark.sql.autoBroadcastJoinThreshold=-1 didn't make a
difference. It still hangs indefinitely.
3. We are using Spark 1.5.2
4. We tried running this with 4 executors, 9 executors, and even in local
mode with master set to "local[4]". The issue still persists in all cases.
5. Even without trying to cache any of the dataframes this issue still
happens,.
6. We have about 200 partitions.

Any help would be appreciated!

Best Regards,
Mo

On Sun, Feb 21, 2016 at 8:39 PM, Gourav Sengupta 
wrote:

> Sorry,
>
> please include the following questions to the list above:
>
> the SPARK version?
> whether you are using RDD or DataFrames?
> is the code run locally or in SPARK Cluster mode or in AWS EMR?
>
>
> Regards,
> Gourav Sengupta
>
> On Sun, Feb 21, 2016 at 7:37 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi Tamara,
>>
>> few basic questions first.
>>
>> How many executors are you using?
>> Is the data getting all cached into the same executor?
>> How many partitions do you have of the data?
>> How many fields are you trying to use in the join?
>>
>> If you need any help in finding answer to these questions please let me
>> know. From what I reckon joins like yours should not take more than a few
>> milliseconds.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Fri, Feb 19, 2016 at 5:31 PM, Tamara Mendt  wrote:
>>
>>> Hi all,
>>>
>>> I am running a Spark job that gets stuck attempting to join two
>>> dataframes. The dataframes are not very large, one is about 2 M rows, and
>>> the other a couple of thousand rows and the resulting joined dataframe
>>> should be about the same size as the smaller dataframe. I have tried
>>> triggering execution of the join using the 'first' operator, which as far
>>> as I understand would not require processing the entire resulting dataframe
>>> (maybe I am mistaken though). The Spark UI is not telling me anything, just
>>> showing the task to be stuck.
>>>
>>> When I run the exact same job on a slightly smaller dataset it works
>>> without hanging.
>>>
>>> I have used the same environment to run joins on much larger dataframes,
>>> so I am confused as to why in this particular case my Spark job is just
>>> hanging. I have also tried running the same join operation using pyspark on
>>> two 2 Million row dataframes (exactly like the one I am trying to join in
>>> the job that gets stuck) and it runs succesfully.
>>>
>>> I have tried caching the joined dataframe to see how much memory it is
>>> requiring but the job gets stuck on this action too. I have also tried
>>> using persist to memory and disk on the join, and the job seems to be stuck
>>> all the same.
>>>
>>> Any help as to where to look for the source of the problem would be much
>>> appreciated.
>>>
>>> Cheers,
>>>
>>> Tamara
>>>
>>>
>>
>


Re: [Example] : read custom schema from file

2016-02-22 Thread Mohannad Ali
Hello Divya,

What kind of file?

Best Regards,
Mohannad

On Mon, Feb 22, 2016 at 8:40 AM, Divya Gehlot 
wrote:

> Hi,
> Can anybody help me by providing  me example how can we read schema of the
> data set from the file.
>
>
>
> Thanks,
> Divya
>


Re: Submit custom python packages from current project

2016-02-16 Thread Mohannad Ali
Hello Ramanathan,

Unfortunately I tried this already and it doesn't work.

Mo

On Tue, Feb 16, 2016 at 2:13 PM, Ramanathan R <ramanatha...@gmail.com>
wrote:

> Have you tried setting PYTHONPATH?
> $ export PYTHONPATH="/path/to/project"
> $ spark-submit --master yarn-client /path/to/project/main_script.py
>
> Regards,
> Ram
>
>
> On 16 February 2016 at 15:33, Mohannad Ali <man...@gmail.com> wrote:
>
>> Hello Everyone,
>>
>> I have code inside my project organized in packages and modules, however
>> I keep getting the error "ImportError: No module named "
>> when I run spark on YARN.
>>
>> My directory structure is something like this:
>>
>> project/
>>  package/
>>  module.py
>>  __init__.py
>>  bin/
>>  docs/
>>  setup.py
>>  main_script.py
>>  requirements.txt
>>  tests/
>>   package/
>>module_test.py
>>__init__.py
>>  __init__.py
>>
>>
>> So when I pass `main_script.py` to spark-submit with master set to
>> "yarn-client", the packages aren't found and I get the error above.
>>
>> With a code structure like this adding everything as pyfile to the spark
>> context seems counter intuitive.
>>
>> I just want to organize my code as much as possible to make it more
>> readable and maintainable. Is there a better way to achieve good code
>> organization without running into such problems?
>>
>> Best Regards,
>> Mo
>>
>
>


Re: reading spark dataframe in python

2016-02-16 Thread Mohannad Ali
I think you need to consider using something like this:
http://sparklingpandas.com/

On Tue, Feb 16, 2016 at 10:59 AM, Devesh Raj Singh 
wrote:

> Hi,
>
> I want to read a spark dataframe using python and then convert the spark
> dataframe to pandas dataframe then convert the pandas dataframe back to
> spark dataframe ( after doing some data analysis) . Please suggest.
>
> --
> Warm regards,
> Devesh.
>


Submit custom python packages from current project

2016-02-16 Thread Mohannad Ali
Hello Everyone,

I have code inside my project organized in packages and modules, however I
keep getting the error "ImportError: No module named " when
I run spark on YARN.

My directory structure is something like this:

project/
 package/
 module.py
 __init__.py
 bin/
 docs/
 setup.py
 main_script.py
 requirements.txt
 tests/
  package/
   module_test.py
   __init__.py
 __init__.py


So when I pass `main_script.py` to spark-submit with master set to
"yarn-client", the packages aren't found and I get the error above.

With a code structure like this adding everything as pyfile to the spark
context seems counter intuitive.

I just want to organize my code as much as possible to make it more
readable and maintainable. Is there a better way to achieve good code
organization without running into such problems?

Best Regards,
Mo