Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-05 Thread Felix Cheung
Congrats and thanks! From: Hyukjin Kwon Sent: Wednesday, March 3, 2021 4:09:23 PM To: Dongjoon Hyun Cc: Gabor Somogyi ; Jungtaek Lim ; angers zhu ; Wenchen Fan ; Kent Yao ; Takeshi Yamamuro ; dev ; user @spark Subject: Re: [ANNOUNCE] Announcing Apache Spark

Fwd: Announcing ApacheCon @Home 2020

2020-07-01 Thread Felix Cheung
-- Forwarded message - We are pleased to announce that ApacheCon @Home will be held online, September 29 through October 1. More event details are available at https://apachecon.com/acah2020 but there’s a few things that I want to highlight for you, the members. Yes, the CFP

Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Felix Cheung
Congrats From: Jungtaek Lim Sent: Thursday, June 18, 2020 8:18:54 PM To: Hyukjin Kwon Cc: Mridul Muralidharan ; Reynold Xin ; dev ; user Subject: Re: [ANNOUNCE] Apache Spark 3.0.0 Great, thanks all for your efforts on the huge step forward! On Fri, Jun 19,

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Felix Cheung
Maybe it’s the reverse - the package is built to run in latest but not compatible with slightly older (3.5.2 was Dec 2018) From: Jeff Zhang Sent: Thursday, December 26, 2019 5:36:50 PM To: Felix Cheung Cc: user.spark Subject: Re: Fail to use SparkR of 3.0

Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Felix Cheung
It looks like a change in the method signature in R base packages. Which version of R are you running on? From: Jeff Zhang Sent: Thursday, December 26, 2019 12:46:12 AM To: user.spark Subject: Fail to use SparkR of 3.0 preview 2 I tried SparkR of spark 3.0

Re: SparkR integration with Hive 3 spark-r

2019-11-24 Thread Felix Cheung
I think you will get more answer if you ask without SparkR. You question is independent on SparkR. Spark support for Hive 3.x (3.1.2) was added here https://github.com/apache/spark/commit/1b404b9b9928144e9f527ac7b1caa15f932c2649 You should be able to connect Spark to Hive metastore.

Re: JDK11 Support in Apache Spark

2019-08-24 Thread Felix Cheung
That’s great! From: ☼ R Nair Sent: Saturday, August 24, 2019 10:57:31 AM To: Dongjoon Hyun Cc: d...@spark.apache.org ; user @spark/'user @spark'/spark users/user@spark Subject: Re: JDK11 Support in Apache Spark Finally!!! Congrats On Sat, Aug 24, 2019, 11:11

Re: [PySpark] [SparkR] Is it possible to invoke a PySpark function with a SparkR DataFrame?

2019-07-16 Thread Felix Cheung
Not currently in Spark. However, there are systems out there that can share DataFrame between languages on top of Spark - it’s not calling the python UDF directly but you can pass the DataFrame to python and then .map(UDF) that way. From: Fiske, Danny Sent:

Re: Spark SQL in R?

2019-06-08 Thread Felix Cheung
I don’t think you should get a hive-xml from the internet. It should have connection information about a running hive metastore - if you don’t have a hive metastore service as you are running locally (from a laptop?) then you don’t really need it. You can get spark to work with it’s own.

Re: sparksql in sparkR?

2019-06-07 Thread Felix Cheung
This seem to be more a question of spark-sql shell? I may suggest you change the email title to get more attention. From: ya Sent: Wednesday, June 5, 2019 11:48:17 PM To: user@spark.apache.org Subject: sparksql in sparkR? Dear list, I am trying to use sparksql

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Felix Cheung
. From: shane knapp Sent: Friday, May 31, 2019 7:38:10 PM To: Denny Lee Cc: Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark Hamstra; Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user Subject: Re: Should python-2 be supported in Spark 3.0? +1000

Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Felix Cheung
We don’t usually reference a future release on website > Spark website and state that Python 2 is deprecated in Spark 3.0 I suspect people will then ask when is Spark 3.0 coming out then. Might need to provide some clarity on that. From: Reynold Xin Sent:

Re: Static partitioning in partitionBy()

2019-05-07 Thread Felix Cheung
You could df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save It could get some data skew problem but might work for you From: Burak Yavuz Sent: Tuesday, May 7, 2019 9:35:10 AM To: Shubham Chaurasia Cc: dev; user@spark.apache.org Subject: Re: Static

Re: ApacheCon NA 2019 Call For Proposal and help promoting Spark project

2019-04-14 Thread Felix Cheung
And a plug for the Graph Processing track - A discussion of comparison talk between the various Spark options (GraphX, GraphFrames, CAPS), or the ongoing work with SPARK-25994 Property Graphs, Cypher Queries, and Algorithms Would be great! From: Felix Cheung

ApacheCon NA 2019 Call For Proposal and help promoting Spark project

2019-04-13 Thread Felix Cheung
Hi Spark community! As you know ApacheCon NA 2019 is coming this Sept and it’s CFP is now open! This is an important milestone as we celebrate 20 years of ASF. We have tracks like Big Data and Machine Learning among many others. Please submit your talks/thoughts/challenges/learnings here:

Re: spark.submit.deployMode: cluster

2019-03-28 Thread Felix Cheung
If anyone wants to improve docs please create a PR. lol But seriously you might want to explore other projects that manage job submission on top of spark instead of rolling your own with spark-submit. From: Pat Ferrel Sent: Tuesday, March 26, 2019 2:38 PM

Re: Spark - Hadoop custom filesystem service loading

2019-03-23 Thread Felix Cheung
Hmm thanks. Do you have a proposed solution? From: Jhon Anderson Cardenas Diaz Sent: Monday, March 18, 2019 1:24 PM To: user Subject: Spark - Hadoop custom filesystem service loading Hi everyone, On spark 2.2.0, if you wanted to create a custom file system

Re: Spark-hive integration on HDInsight

2019-02-21 Thread Felix Cheung
You should check with HDInsight support From: Jay Singh Sent: Wednesday, February 20, 2019 11:43:23 PM To: User Subject: Spark-hive integration on HDInsight I am trying to integrate spark with hive on HDInsight spark cluster . I copied hive-site.xml in

Re: SparkR + binary type + how to get value

2019-02-19 Thread Felix Cheung
there: From: Thijs Haarhuis Sent: Tuesday, February 19, 2019 5:28 AM To: Felix Cheung; user@spark.apache.org Subject: Re: SparkR + binary type + how to get value Hi Felix, Thanks. I got it working now by using the unlist function. I have another question, maybe you can help me with, since I did

Re: SparkR + binary type + how to get value

2019-02-17 Thread Felix Cheung
: Thijs Haarhuis Sent: Thursday, February 14, 2019 4:01 AM To: Felix Cheung; user@spark.apache.org Subject: Re: SparkR + binary type + how to get value Hi Felix, Sure.. I have the following code: printSchema(results) cat("\n\n\n") firstRow <- first(results

Re: SparkR + binary type + how to get value

2019-02-13 Thread Felix Cheung
Please share your code From: Thijs Haarhuis Sent: Wednesday, February 13, 2019 6:09 AM To: user@spark.apache.org Subject: SparkR + binary type + how to get value Hi all, Does anybody have any experience in accessing the data from a column which has a binary

Re: java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-10 Thread Felix Cheung
And it might not work completely. Spark only officially supports JDK 8. I’m not sure if JDK 9 and + support is complete? From: Jungtaek Lim Sent: Thursday, February 7, 2019 5:22 AM To: Gabor Somogyi Cc: Hande, Ranjit Dilip (Ranjit); user@spark.apache.org

Re: I have trained a ML model, now what?

2019-01-23 Thread Felix Cheung
Please comment in the JIRA/SPIP if you are interested! We can see the community support for a proposal like this. From: Pola Yao Sent: Wednesday, January 23, 2019 8:01 AM To: Riccardo Ferrari Cc: Felix Cheung; User Subject: Re: I have trained a ML model, now

Re: I have trained a ML model, now what?

2019-01-22 Thread Felix Cheung
About deployment/serving SPIP https://issues.apache.org/jira/browse/SPARK-26247 From: Riccardo Ferrari Sent: Tuesday, January 22, 2019 8:07 AM To: User Subject: I have trained a ML model, now what? Hi list! I am writing here to here about your experience on

Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Felix Cheung
You can call coalesce to combine partitions.. From: Shivam Sharma <28shivamsha...@gmail.com> Sent: Saturday, January 19, 2019 7:43 AM To: user@spark.apache.org Subject: Persist Dataframe to HDFS considering HDFS Block Size. Hi All, I wanted to persist dataframe

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-19 Thread Felix Cheung
From: Li Gao Sent: Saturday, January 19, 2019 8:43 AM To: Felix Cheung Cc: Serega Sheypak; user Subject: Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job? on yarn it is impossible afaik. on kubernetes you can use taints

Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-18 Thread Felix Cheung
Not as far as I recall... From: Serega Sheypak Sent: Friday, January 18, 2019 3:21 PM To: user Subject: Spark on Yarn, is it possible to manually blacklist nodes before running spark job? Hi, is there any possibility to tell Scheduler to blacklist specific

Re: spark2.4 arrow enabled true,error log not returned

2019-01-12 Thread Felix Cheung
Do you mean you run the same code on yarn and standalone? Can you check if they are running the same python versions? From: Bryan Cutler Sent: Thursday, January 10, 2019 5:29 PM To: libinsong1...@gmail.com Cc: zlist Spark Subject: Re: spark2.4 arrow enabled

Re: SparkR issue

2018-10-14 Thread Felix Cheung
1 seems like its spending a lot of time in R (slicing the data I guess?) and not with Spark 2 could you write it into a csv file locally and then read it from Spark? From: ayan guha Sent: Monday, October 8, 2018 11:21 PM To: user Subject: SparkR issue Hi We

Re: can Spark 2.4 work on JDK 11?

2018-09-29 Thread Felix Cheung
Not officially. We have seen problem with JDK 10 as well. It will be great if you or someone would like to contribute to get it to work.. From: kant kodali Sent: Tuesday, September 25, 2018 2:31 PM To: user @spark Subject: can Spark 2.4 work on JDK 11? Hi All,

Re: spark.lapply

2018-09-26 Thread Felix Cheung
It looks like the native R process is terminated from buffer overflow. Do you know how much data is involved? From: Junior Alvarez Sent: Wednesday, September 26, 2018 7:33 AM To: user@spark.apache.org Subject: spark.lapply Hi! I’m using spark.lapply() in

Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Felix Cheung
I don’t think we should remove any API even in a major release without deprecating it first... From: Mark Hamstra Sent: Sunday, September 16, 2018 12:26 PM To: Erik Erlandson Cc: user@spark.apache.org; dev Subject: Re: Should python-2 be supported in Spark 3.0?

Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread Felix Cheung
I'm not sure we have completed support for Java 10 From: Rahul Agrawal Sent: Thursday, June 21, 2018 7:22:42 AM To: user@spark.apache.org Subject: Spark 2.3.1 not working on Java 10 Dear Team, I have installed Java 10, Scala 2.12.6 and spark 2.3.1 in my

Re: all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Felix Cheung
Zeppelin keeps the Spark job alive. This is likely a better question for the Zeppelin project. From: Valery Khamenya Sent: Tuesday, May 1, 2018 4:30:24 AM To: user@spark.apache.org Subject: all calculations finished, but "VCores Used" value

Re: Problem running Kubernetes example v2.2.0-kubernetes-0.5.0

2018-04-22 Thread Felix Cheung
You might want to check with the spark-on-k8s Or try using kubernetes from the official spark 2.3.0 release. (Yes we don't have an official docker image though but you can build with the script) From: Rico Bergmann Sent: Wednesday, April

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-06 Thread Felix Cheung
Instead of write to console you need to write to memory for it to be queryable .format("memory") .queryName("tableName") https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks From: Aakash Basu

Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Felix Cheung
If your data can be split into groups and you can call into your favorite R package on each group of data (in parallel): https://spark.apache.org/docs/latest/sparkr.html#run-a-given-function-on-a-large-dataset-grouping-by-input-columns-and-using-gapply-or-gapplycollect

Re: Custom metrics sink

2018-03-16 Thread Felix Cheung
There is a proposal to expose them. See SPARK-14151 From: Christopher Piggott Sent: Friday, March 16, 2018 1:09:38 PM To: user@spark.apache.org Subject: Custom metrics sink Just for fun, i want to make a stupid program that makes different

Re: How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Felix Cheung
It’s best to start with Structured Streaming https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#tab_python_0 https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#tab_python_0 _ From: Aakash Basu

Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
For pyspark specifically IMO should be very high on the list to port back... As for roadmap - should be sharing more soon. From: lucas.g...@gmail.com <lucas.g...@gmail.com> Sent: Friday, March 2, 2018 9:41:46 PM To: user@spark.apache.org Cc: Felix Cheung S

Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
That's in the plan. We should be sharing a bit more about the roadmap in future releases shortly. In the mean time this is in the official documentation on what is coming: https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work This supports started as a fork of the Apache

Re: Spark on K8s - using files fetched by init-container?

2018-02-27 Thread Felix Cheung
Yes you were pointing to HDFS on a loopback address... From: Jenna Hoole Sent: Monday, February 26, 2018 1:11:35 PM To: Yinan Li; user@spark.apache.org Subject: Re: Spark on K8s - using files fetched by init-container? Oh, duh. I

Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

2018-02-20 Thread Felix Cheung
No it does not support bi directional edges as of now. _ From: xiaobo <guxiaobo1...@qq.com> Sent: Tuesday, February 20, 2018 4:35 AM Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships To: Felix Cheung <felixcheun...@hotmail.co

Re: [graphframes]how Graphframes Deal With Bidirectional Relationships

2018-02-19 Thread Felix Cheung
Generally that would be the approach. But since you have effectively double the number of edges this will likely affect the scale your job will run. From: xiaobo Sent: Monday, February 19, 2018 3:22:02 AM To: user@spark.apache.org Subject:

Re: Does Pyspark Support Graphx?

2018-02-18 Thread Felix Cheung
Hi - I’m maintaining it. As of now there is an issue with 2.2 that breaks personalized page rank, and that’s largely the reason there isn’t a release for 2.2 support. There are attempts to address this issue - if you are interested we would love for your help.

Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Felix Cheung
Yes it is issue with the newer release of testthat. To workaround could you install an earlier version with devtools? will follow up for a fix. _ From: Hyukjin Kwon Sent: Wednesday, February 14, 2018 6:49 PM Subject: Re: SparkR test script

Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Felix Cheung
java.nio.BufferUnderflowException Can you try reading the same data in Scala? From: Liana Napalkova Sent: Wednesday, January 10, 2018 12:04:00 PM To: Timur Shenkao Cc: user@spark.apache.org Subject: Re: py4j.protocol.Py4JJavaError:

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
And Hadoop-3.x is not part of the release and sign off for 2.2.1. Maybe we could update the website to avoid any confusion with "later". From: Josh Rosen Sent: Monday, January 8, 2018 10:17:14 AM To: akshay naidu Cc: Saisai Shao; Raj

Re: Passing an array of more than 22 elements in a UDF

2017-12-26 Thread Felix Cheung
7 9:13 PM Subject: Re: Passing an array of more than 22 elements in a UDF To: Felix Cheung <felixcheun...@hotmail.com> Cc: ayan guha <guha.a...@gmail.com>, user <user@spark.apache.org> What's the privilege of using that specific version for this? Please throw some light onto i

Re: Spark 2.2.1 worker invocation

2017-12-26 Thread Felix Cheung
I think you are looking for spark.executor.extraJavaOptions https://spark.apache.org/docs/latest/configuration.html#runtime-environment From: Christopher Piggott Sent: Tuesday, December 26, 2017 8:00:56 AM To: user@spark.apache.org Subject:

Re: Passing an array of more than 22 elements in a UDF

2017-12-24 Thread Felix Cheung
Or use it with Scala 2.11? From: ayan guha Sent: Friday, December 22, 2017 3:15:14 AM To: Aakash Basu Cc: user Subject: Re: Passing an array of more than 22 elements in a UDF Hi I think you are in correct track. You can stuff all your param

Re: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Felix Cheung
; Sent: Tuesday, November 28, 2017 3:11 AM Subject: AW: [Spark R]: dapply only works for very small datasets To: Felix Cheung <felixcheun...@hotmail.com>, <user@spark.apache.org> Thanks for the fast reply. I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as w

Re: [Spark R]: dapply only works for very small datasets

2017-11-27 Thread Felix Cheung
What's the number of executor and/or number of partitions you are working with? I'm afraid most of the problem is with the serialization deserialization overhead between JVM and R... From: Kunft, Andreas Sent: Monday, November 27,

Re: using R with Spark

2017-09-24 Thread Felix Cheung
et.net/> www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba> Twitter: @BobLovesData<http://twitter.com/BobLovesData> From: Georg Heiler [mailto:georg.kf.hei...@gmail.com] Sent: Sunday, September 24, 2017 3:39 PM To: Felix Cheung <felixcheun...@hot

Re: using R with Spark

2017-09-24 Thread Felix Cheung
If you google it you will find posts or info on how to connect it to different cloud and hadoop/spark vendors. From: Georg Heiler <georg.kf.hei...@gmail.com> Sent: Sunday, September 24, 2017 1:39:09 PM To: Felix Cheung; Adaryl Wakefield; user@spark.apac

Re: using R with Spark

2017-09-24 Thread Felix Cheung
Both are free to use; you can use sparklyr from the R shell without RStudio (but you probably want an IDE) From: Adaryl Wakefield Sent: Sunday, September 24, 2017 11:19:24 AM To: user@spark.apache.org Subject: using R with Spark

Re: graphframes on cluster

2017-09-20 Thread Felix Cheung
Could you include the code where it fails? Generally the best way to use gf is to use the --packages options with spark-submit command From: Imran Rajjad Sent: Wednesday, September 20, 2017 5:47:27 AM To: user @spark Subject: graphframes on

Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread Felix Cheung
What is newDS? If it is a Streaming Dataset/DataFrame (since you have writeStream there) then there seems to be an issue preventing toJSON to work. From: kant kodali Sent: Saturday, September 9, 2017 4:04:33 PM To: user @spark Subject:

Re: How to convert Row to JSON in Java?

2017-09-09 Thread Felix Cheung
toJSON on Dataset/DataFrame? From: kant kodali Sent: Saturday, September 9, 2017 4:15:49 PM To: user @spark Subject: How to convert Row to JSON in Java? Hi All, How to convert Row to JSON in Java? It would be nice to have .toJson() method

Re: sparkR 3rd library

2017-09-04 Thread Felix Cheung
Can you include the code you call spark.lapply? From: patcharee Sent: Sunday, September 3, 2017 11:46:40 PM To: spar >> user@spark.apache.org Subject: sparkR 3rd library Hi, I am using spark.lapply to execute an existing R script in

Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Felix Cheung
Awesome! Congrats!! From: holden.ka...@gmail.com on behalf of Holden Karau Sent: Wednesday, July 12, 2017 12:26:00 PM To: user@spark.apache.org Subject: With 2.2.0 PySpark is now available for pip install from PyPI

Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Felix Cheung
And perhaps the error message can be improved here? From: Tathagata Das Sent: Monday, June 19, 2017 8:24:01 PM To: kaniska Mandal Cc: Burak Yavuz; user Subject: Re: How save streaming aggregations on 'Structured Streams' in parquet

Re: problem initiating spark context with pyspark

2017-06-10 Thread Felix Cheung
Curtis, assuming you are running a somewhat recent windows version you would not have access to c:\tmp, in your command example winutils.exe ls -F C:\tmp\hive Try changing the path to under your user directory. Running Spark on Windows should work :) From:

Re: "java.lang.IllegalStateException: There is no space for new record" in GraphFrames

2017-04-28 Thread Felix Cheung
Can you allocate more memory to the executor? Also please open issue with gf on its github From: rok Sent: Friday, April 28, 2017 1:42:33 AM To: user@spark.apache.org Subject: "java.lang.IllegalStateException: There is no space for new

Re: how to create List in pyspark

2017-04-28 Thread Felix Cheung
Why no use sql functions explode and split? Would perform and be more stable then udf From: Yanbo Liang Sent: Thursday, April 27, 2017 7:34:54 AM To: Selvam Raman Cc: user Subject: Re: how to create List in pyspark ​You can try with UDF, like

Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-22 Thread Felix Cheung
Cross session is this context is multiple spark sessions from the same spark context. Since you are running two shells, you are having different spark context. Do you have to you a temp view? Could you create a table? _ From: Hemanth Gudela

Re: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ?

2017-04-21 Thread Felix Cheung
Not currently - how are you planning to use the output from word2vec? From: Radhwane Chebaane Sent: Thursday, April 20, 2017 4:30:14 AM To: user@spark.apache.org Subject: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ? Hi,

Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-02 Thread Felix Cheung
Interesting! From: Robert Yokota Sent: Sunday, April 2, 2017 9:40:07 AM To: user@spark.apache.org Subject: Graph Analytics on HBase with HGraphDB and Spark GraphFrames Hi, In case anyone is interested in analyzing graphs in HBase with Apache

Re: Getting exit code of pipe()

2017-02-12 Thread Felix Cheung
Subject: Re: Getting exit code of pipe() To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> Cc: <user@spark.apache.org<mailto:user@spark.apache.org>> Cool that's exactly what I was looking for! Thanks! How does one output the status into

Re: Getting exit code of pipe()

2017-02-11 Thread Felix Cheung
Do you want the job to fail if there is an error exit code? You could set checkCode to True spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe Otherwise maybe you want

Re: Examples in graphx

2017-01-29 Thread Felix Cheung
Which graph do you are thinking about? Here's one for neo4j https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/ From: Deepak Sharma Sent: Sunday, January 29, 2017 4:28:19 AM To: spark users Subject: Examples in graphx Hi There, Are

Re: Creating UUID using SparksSQL

2017-01-18 Thread Felix Cheung
spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id ? From: Ninad Shringarpure

Re: what does dapply actually do?

2017-01-18 Thread Felix Cheung
With Spark, the processing is performed lazily. This means nothing much is really happening until you call an "action" - an example that is collect(). Another way is to write the output in a distributed manner - see write.df() in R. With SparkR dapply() passing the data from Spark to R to

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
. From: Ankur Srivastava <ankur.srivast...@gmail.com> Sent: Thursday, January 5, 2017 3:45:59 PM To: Felix Cheung; d...@spark.apache.org Cc: user@spark.apache.org Subject: Re: Spark GraphFrame ConnectedComponents Adding DEV mailing list to see if this is a defect with ConnectedCom

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
uary 5, 2017 10:05:03 AM To: Felix Cheung Cc: user@spark.apache.org Subject: Re: Spark GraphFrame ConnectedComponents Yes it works to read the vertices and edges data from S3 location and is also able to write the checkpoint files to S3. It only fails when deleting the data and that is because it

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
nkur.srivast...@gmail.com>> Sent: Wednesday, January 4, 2017 9:23 PM Subject: Re: Spark GraphFrame ConnectedComponents To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> Cc: <user@spark.apache.org<mailto:user@spark.apache.org>> This is the exact trace

Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Felix Cheung
Do you have more of the exception stack? From: Ankur Srivastava Sent: Wednesday, January 4, 2017 4:40:02 PM To: user@spark.apache.org Subject: Spark GraphFrame ConnectedComponents Hi, I am trying to use the ConnectedComponent

Re: Issue with SparkR setup on RStudio

2017-01-02 Thread Felix Cheung
is not set in the Windows tests. _ From: Md. Rezaul Karim <rezaul.ka...@insight-centre.org<mailto:rezaul.ka...@insight-centre.org>> Sent: Monday, January 2, 2017 7:58 AM Subject: Re: Issue with SparkR setup on RStudio To: Felix Cheung <felixcheun...@hotm

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-31 Thread Felix Cheung
ect: Re: How to load a big csv to dataframe in Spark 1.6 To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> Cc: <user@spark.apache.org<mailto:user@spark.apache.org>> Hello Felix, I followed the instruction and ran the command: >

Re: Spark Graphx with Database

2016-12-30 Thread Felix Cheung
You might want to check out GraphFrames - to load database data (as Spark DataFrame) and build graphs with them https://github.com/graphframes/graphframes _ From: balaji9058 > Sent: Monday, December 26, 2016 9:27 PM

Re: Difference in R and Spark Output

2016-12-30 Thread Felix Cheung
Could you elaborate more on the huge difference you are seeing? From: Saroj C Sent: Friday, December 30, 2016 5:12:04 AM To: User Subject: Difference in R and Spark Output Dear All, For the attached input file, there is a huge difference

Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Felix Cheung
Have you tried the spark-csv package? https://spark-packages.org/package/databricks/spark-csv From: Raymond Xie Sent: Friday, December 30, 2016 6:46:11 PM To: user@spark.apache.org Subject: How to load a big csv to dataframe in Spark 1.6

Re: Issue with SparkR setup on RStudio

2016-12-29 Thread Felix Cheung
Any reason you are setting HADOOP_HOME? >From the error it seems you are running into issue with Hive config likely >with trying to load hive-site.xml. Could you try not setting HADOOP_HOME From: Md. Rezaul Karim Sent:

Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
There is not a GraphLoader for GraphFrames but you could load and convert from GraphX: http://graphframes.github.io/user-guide.html#graphx-to-graphframe From: zjp_j...@163.com <zjp_j...@163.com> Sent: Sunday, December 18, 2016 9:39:49 PM To: Felix Cheung

Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Or this is a better link: http://graphframes.github.io/quick-start.html _ From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> Sent: Sunday, December 18, 2016 8:46 PM Subject: Re: GraphFrame not init vertices when load edge

Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Can you clarify? Vertices should be another DataFrame as you can see in the example here: https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md From: zjp_j...@163.com Sent: Sunday, December 18, 2016 6:25:50 PM To: user

Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread Felix Cheung
What is the format? From: KhajaAsmath Mohammed Sent: Thursday, December 15, 2016 7:54:27 PM To: user @spark Subject: Spark Dataframe: Save to hdfs is taking long time Hi, I am using issue while saving the dataframe back to HDFS. It's

Re: How to load edge with properties file useing GraphX

2016-12-15 Thread Felix Cheung
Have you checked out https://github.com/graphframes/graphframes? It might be easier to work with DataFrame. From: zjp_j...@163.com Sent: Thursday, December 15, 2016 7:23:57 PM To: user Subject: How to load edge with properties file useing

Re: [GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Felix Cheung
That's correct - currently GraphFrame does not compute PageRank with weighted edges. _ From: Weiwei Zhang > Sent: Thursday, December 1, 2016 2:41 PM Subject: [GraphFrame, Pyspark] Weighted Edge in PageRank To:

Re: PySpark to remote cluster

2016-11-30 Thread Felix Cheung
Spark 2.0.1 is running with a different py4j library than Spark 1.6. You will probably run into other problems mixing versions though - is there a reason you can't run Spark 1.6 on the client? _ From: Klaus Schaefers

Re: How to propagate R_LIBS to sparkr executors

2016-11-17 Thread Felix Cheung
Have you tried spark.executorEnv.R_LIBS? spark.apache.org/docs/latest/configuration.html#runtime-environment _ From: Rodrick Brown > Sent: Wednesday, November 16, 2016 1:01 PM Subject: How to propagate R_LIBS to

Re: Strongly Connected Components

2016-11-10 Thread Felix Cheung
It is possible it is dead. Could you check the Spark UI to see if there is any progress? _ From: Shreya Agarwal > Sent: Thursday, November 10, 2016 12:45 AM Subject: RE: Strongly Connected Components To:

Re: Issue Running sparkR on YARN

2016-11-09 Thread Felix Cheung
It maybe the Spark executor is running as a different user and it can't see where RScript is? You might want to try putting Rscript path to PATH. Also please see this for the config property to set for the R command to use: https://spark.apache.org/docs/latest/configuration.html#sparkr

Re: Substitute Certain Rows a data Frame using SparkR

2016-10-19 Thread Felix Cheung
It's a bit less concise but this works: > a <- as.DataFrame(cars) > head(a) speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 > b <- withColumn(a, "speed", ifelse(a$speed > 15, a$speed, 3)) > head(b) speed dist 1 3 2 2 3 10 3 3 4 4 3 22 5 3 16 6 3 10 I think your example could be something

Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-18 Thread Felix Cheung
ink a 2.0 uber jar will play nicely on a 1.5 standalone cluster. On Saturday, September 10, 2016, Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote: You should be able to get it to work with 2.0 as uber jar. What type cluster you are running on? YARN? An

Re: SparkR error: reference is ambiguous.

2016-09-10 Thread Felix Cheung
Could you provide more information on how df in your example is created? Also please include the output from printSchema(df)? This example works: > c <- createDataFrame(cars) > c SparkDataFrame[speed:double, dist:double] > c$speed <- c$dist*0 > c SparkDataFrame[speed:double, dist:double] >

Re: questions about using dapply

2016-09-10 Thread Felix Cheung
You might need MARGIN capitalized, this example works though: c <- as.DataFrame(cars) # rename the columns to c1, c2 c <- selectExpr(c, "speed as c1", "dist as c2") cols_in <- dapplyCollect(c, function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ y %in% c(61, 99)})}) #

Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
How are you calling dirs()? What would be x? Is dat a SparkDataFrame? With SparkR, i in dat[i, 4] should be an logical expression for row, eg. df[df$age %in% c(19, 30), 1:2] On Sat, Sep 10, 2016 at 11:02 AM -0700, "Bene" >

Re: Assign values to existing column in SparkR

2016-09-10 Thread Felix Cheung
If you are to set a column to 0 (essentially remove and replace the existing one) you would need to put a column on the right hand side: > df <- as.DataFrame(iris) > head(df) Sepal_Length Sepal_Width Petal_Length Petal_Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2

Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
Could you include code snippets you are running? On Sat, Sep 10, 2016 at 1:44 AM -0700, "Bene" > wrote: Hi, I am having a problem with the SparkR API. I need to subset a distributed data so I can extract single values from

  1   2   >