Complex transformation on a dataframe column

2015-10-15 Thread Hao Wang
(nullable = true) The function for format addresses: def formatAddress(address: String): String Best regards, Hao Wang - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: How to convert dataframe to a nested StructType schema

2015-09-17 Thread Hao Wang
t.createDataFrame(rowrdd, yourNewSchema) > > Thanks! > > - Terry > > On Wed, Sep 16, 2015 at 2:10 AM, Hao Wang <billhao.l...@gmail.com> wrote: > >> Hi, >> >> I created a dataframe with 4 string columns (city, state, country, >> zipcode). >> I then a

How to convert dataframe to a nested StructType schema

2015-09-15 Thread Hao Wang
Hi, I created a dataframe with 4 string columns (city, state, country, zipcode). I then applied the following nested schema to it by creating a custom StructType. When I run df.take(5), it gives the exception below as expected. The question is how I can convert the Rows in the dataframe to

How to convert dataframe to a nested StructType schema

2015-09-14 Thread Hao Wang
Hi, I created a dataframe with 4 string columns (city, state, country, zipcode). I then applied the following nested schema to it by creating a custom StructType. When I run df.take(5), it gives the exception below as expected. The question is how I can convert the Rows in the dataframe to

Re: How to split log data into different files according to severity

2015-06-14 Thread Hao Wang
.html https://www.mail-archive.com/user@spark.apache.org/msg30204.html On June 13, 2015, at 5:41 AM, Hao Wang bill...@gmail.com wrote: Hi, I have a bunch of large log files on Hadoop. Each line contains a log and its severity. Is there a way that I can use Spark to split the entire data

Re: How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
I am currently using filter inside a loop of all severity levels to do this, which I think is pretty inefficient. It has to read the entire data set once for each severity. I wonder if there is a more efficient way that takes just one pass of the data? Thanks. Best, Hao Wang On Jun 13, 2015

How to split log data into different files according to severity

2015-06-13 Thread Hao Wang
] log2 [ERROR] log3 [INFO] log4 Output: error_file [ERROR] log1 [ERROR] log3 info_file [INFO] log2 [INFO] log4 Best, Hao Wang

Re: Kyro deserialisation error

2014-07-17 Thread Hao Wang
Hi, all Yes, it's a name of Wikipedia article. I am running WikipediaPageRank example of Spark Bagels. I am wondering whether there is any relation to buffer size of Kyro. The page rank can be successfully finished, sometimes not because this kind of Kyro exception happens too many times, which

Re: Kyro deserialisation error

2014-07-16 Thread Hao Wang
, Shanghai, 200240 Email:wh.s...@gmail.com On Wed, Jul 16, 2014 at 1:41 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Are you using classes from external libraries that have not been added to the sparkContext, using sparkcontext.addJar()? TD On Tue, Jul 15, 2014 at 8:36 PM, Hao Wang

Re: Kyro deserialisation error

2014-07-16 Thread Hao Wang
Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com On Thu, Jul 17, 2014 at 2:58 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Is the class that is not found in the wikipediapagerank jar? TD On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wh.s...@gmail.com wrote: Thanks

Re: Kyro deserialisation error

2014-07-15 Thread Hao Wang
I am running the WikipediaPageRank in Spark example and share the same problem with you: 4/07/16 11:31:06 DEBUG DAGScheduler: submitStage(Stage 6) 14/07/16 11:31:06 ERROR TaskSetManager: Task 6.0:450 failed 4 times; aborting job 14/07/16 11:31:06 INFO DAGScheduler: Failed to run foreach at

Re: long GC pause during file.cache()

2014-06-15 Thread Hao Wang
Hi, Wei You may try to set JVM opts in *spark-env.sh* as follow to prevent or mitigate GC pause: export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m There are more options you could add, please just Google :) Regards, Wang Hao(王灏) CloudTeam |

Akka listens to hostname while user may spark-submit with master in IP url

2014-06-15 Thread Hao Wang
Hi, All In Spark the spark.driver.host is driver hostname in default, thus, akka actor system will listen to a URL like akka.tcp://hostname:port. However, when a user tries to use spark-submit to run application, the user may set --master spark://192.168.1.12:7077. Then, the *AppClient* in

BUG? Why does MASTER have to be set to spark://hostname:port?

2014-06-13 Thread Hao Wang
Hi, all When I try to run Spark PageRank using: ./bin/spark-submit \ --master spark://192.168.1.12:7077 \ --class org.apache.spark.examples.bagel.WikipediaPageRank \ ~/Documents/Scala/WikiPageRank/target/scala-2.10/wikipagerank_2.10-1.0.jar \ hdfs://192.168.1.12:9000/freebase-13G 0.05 100

Re: how to set spark.executor.memory and heap size

2014-06-13 Thread Hao Wang
Hi, Laurent You could set Spark.executor.memory and heap size by following methods: 1. in you conf/spark-env.sh: *export SPARK_WORKER_MEMORY=38g* *export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m* 2. you could also add modification for

Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Hao Wang
Hi, all Why does the Spark 1.0.0 official doc remove how to build Spark with corresponding Hadoop version? It means that if I don't need to specify the Hadoop version with I build my Spark 1.0.0 with `sbt/sbt assembly`? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai

Re: Spark 1.0.0 Standalone AppClient cannot connect Master

2014-06-12 Thread Hao Wang
Wang Hao, This is not removed. We moved it here: http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html If you're building with SBT, and you don't specify the SPARK_HADOOP_VERSION, then it defaults to 1.0.4. Andrew 2014-06-12 6:24 GMT-07:00 Hao Wang wh.s...@gmail.com

Re: com.google.protobuf out of memory

2014-05-25 Thread Hao Wang
Hi, Zuhair According to my experience, you could try following steps to avoid Spark OOM: 1. Increase JVM memory by adding export SPARK_JAVA_OPTS=-Xmx2g 2. Use .persist(storage.StorageLevel.MEMORY_AND_DISK) instead of .cache() 3. Have you set spark.executor.memory value? It's 512m by

For performance, Spark prefers OracleJDK or OpenJDK?

2014-05-19 Thread Hao Wang
Hi, Oracle JDK and OpenJDK, which one is better or preferred for Spark? Regards, Wang Hao(王灏) CloudTeam | School of Software Engineering Shanghai Jiao Tong University Address:800 Dongchuan Road, Minhang District, Shanghai, 200240 Email:wh.s...@gmail.com

java.lang.NoClassDefFoundError: org/apache/spark/deploy/worker/Worker

2014-05-18 Thread Hao Wang
Hi, all *Spark version: bae07e3 [behind 1] fix different versions of commons-lang dependency and apache/spark#746 addendum* I have six worker nodes and four of them have this NoClassDefFoundError when I use thestart-slaves.sh on my driver node. However, running ./bin/spark-class