Re: testing frameworks

2019-02-04 Thread Marco Mistroni
Thanks Hichame will follow up on that

Anyonen on this list using python version of spark-testing-base? seems
theres support for DataFrame

thanks in advance and regards
 Marco

On Sun, Feb 3, 2019 at 9:58 PM Hichame El Khalfi 
wrote:

> Hi,
> You can use pysparkling => https://github.com/svenkreiss/pysparkling
> This lib is useful in case you have RDD.
>
> Hope this helps,
>
> Hichame
>
> *From:* mmistr...@gmail.com
> *Sent:* February 3, 2019 4:42 PM
> *To:* radams...@gmail.com
> *Cc:* la...@mapflat.com; bpru...@opentext.com; user@spark.apache.org
> *Subject:* Re: testing frameworks
>
> Hi
>  sorry to resurrect this thread
> Any spark libraries for testing code in pyspark?  the github code above
> seems related to Scala
> following links in the original threads (and also LMGFY) i found out
> pytest-spark · PyPI 
>
> w/kindest regards
>  Marco
>
>
>
>
> On Tue, Jun 12, 2018 at 6:44 PM Ryan Adams  wrote:
>
>> We use spark testing base for unit testing.  These tests execute on a
>> very small amount of data that covers all paths the code can take (or most
>> paths anyway).
>>
>> https://github.com/holdenk/spark-testing-base
>>
>> For integration testing we use automated routines to ensure that
>> aggregate values match an aggregate baseline.
>>
>> Ryan
>>
>> Ryan Adams
>> radams...@gmail.com
>>
>> On Tue, Jun 12, 2018 at 11:51 AM, Lars Albertsson 
>> wrote:
>>
>>> Hi,
>>>
>>> I wrote this answer to the same question a couple of years ago:
>>> https://www.mail-archive.com/user%40spark.apache.org/msg48032.html
>>>
>>> I have made a couple of presentations on the subject. Slides and video
>>> are linked on this page: http://www.mapflat.com/presentations/
>>>
>>> You can find more material in this list of resources:
>>> http://www.mapflat.com/lands/resources/reading-list
>>>
>>> Happy testing!
>>>
>>> Regards,
>>>
>>>
>>>
>>> Lars Albertsson
>>> Data engineering consultant
>>> www.mapflat.com
>>> https://twitter.com/lalleal
>>> +46 70 7687109
>>> Calendar: http://www.mapflat.com/calendar
>>>
>>>
>>> On Mon, May 21, 2018 at 2:24 PM, Steve Pruitt 
>>> wrote:
>>> > Hi,
>>> >
>>> >
>>> >
>>> > Can anyone recommend testing frameworks suitable for Spark jobs.
>>> Something
>>> > that can be integrated into a CI tool would be great.
>>> >
>>> >
>>> >
>>> > Thanks.
>>> >
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: SPIP: DataFrame-based Property Graphs, Cypher Queries, and Algorithms

2019-02-04 Thread Simon Hewitt
Currently working on a large dataset where it would be much easier modelled as 
using Cypher. Very interested in this proposal

On 2019/01/15 16:52:44, Xiangrui Meng  wrote: 
> Hi all,> 
> 
> I want to re-send the previous SPIP on introducing a DataFrame-based graph> 
> component to collect more feedback. It supports property graphs, Cypher> 
> graph queries, and graph algorithms built on top of the DataFrame API. If> 
> you are a GraphX user or your workload is essentially graph queries, please> 
> help review and check how it fits into your use cases. Your feedback would> 
> be greatly appreciated!> 
> 
> # Links to SPIP and design sketch:> 
> 
> * Jira issue for the SPIP: https://issues.apache.org/jira/browse/SPARK-25994> 
> * Google Doc:> 
> https://docs.google.com/document/d/1ljqVsAh2wxTZS8XqwDQgRT6i_mania3ffYSYpEgLx9k/edit?usp=sharing>
>  
> * Jira issue for a first design sketch:> 
> https://issues.apache.org/jira/browse/SPARK-26028> 
> * Google Doc:> 
> https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI/edit?usp=sharing>
>  
> 
> # Sample code:> 
> 
> ~~~> 
> val graph = ...> 
> 
> // query> 
> val result = graph.cypher("""> 
>   MATCH (p:Person)-[r:STUDY_AT]->(u:University)> 
>   RETURN p.name, r.since, u.name> 
> """)> 
> 
> // algorithms> 
> val ranks = graph.pageRank.run()> 
> ~~~> 
> 
> Best,> 
> Xiangrui> 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Unsubscribe

2019-02-04 Thread Sunil Prabhakara
Unsubscribe


Can not start thrift-server on spark2.4

2019-02-04 Thread Moein Hosseini
I like to start spark thrift server on cluster of 3 machine with HDFS and
standalone HA spark (v2.4).
So I started it with following command under user spark24 but get runtime
exception about hdfs permissions.
*Command:*
*./start-thriftserver.sh --master spark://master:7077*
* exception:*
*Caused by: java.lang.RuntimeException: java.lang.RuntimeException: The
root scratch dir: /tmp/hive on HDFS should be writable. Current permissions
are: rwxr-xr-x  *

But I think the hdfs permission is fine because I chmod everything under
*/tmp/hive* to *777* and set rwx access to spark24 and hive users and all
groups.

*$ hdfs dfs -ls -d /tmp/hive*
*drwxrwxrwx+  - hive hdfs  0 2019-02-03 14:13 /tmp/hive*

*$ hdfs dfs -getfacl /tmp/hive*
*# file: /tmp/hive*
*# owner: hive*
*# group: hdfs*
*user::rwx*
*user:hive:rwx*
*user:spark24:rwx*
*group::rwx*
*group:hive:rwx*
*group:spark24:rwx*
*mask::rwx*
*other::rwx*

What is wrong with in my case?
-- 

Moein Hosseini
Data Engineer
mobile: +98 912 468 1859 <+98+912+468+1859>
site: www.moein.xyz
email: moein...@gmail.com
[image: linkedin] 
[image: twitter] 


Re: Avoiding collect but use foreach

2019-02-04 Thread 刘虓
hi,
I think you can make your python code into an udf and call udf in
foreachpartition.

Aakash Basu  于2019年2月1日周五 下午3:37写道:

> Hi,
>
> This:
>
>
> *to_list = [list(row) for row in df.collect()]*
>
>
> Gives:
>
>
> [[5, 1, 1, 1, 2, 1, 3, 1, 1, 0], [5, 4, 4, 5, 7, 10, 3, 2, 1, 0], [3, 1,
> 1, 1, 2, 2, 3, 1, 1, 0], [6, 8, 8, 1, 3, 4, 3, 7, 1, 0], [4, 1, 1, 3, 2, 1,
> 3, 1, 1, 0]]
>
>
> I want to avoid collect operation, but still convert the dataframe to a
> python list of list just as above for downstream operations.
>
>
> Is there a way, I can do it, maybe a better performant code that using
> collect?
>
>
> Thanks,
>
> Aakash.
>