(nullable = true)
The function for format addresses:
def formatAddress(address: String): String
Best regards,
Hao Wang
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h
t.createDataFrame(rowrdd, yourNewSchema)
>
> Thanks!
>
> - Terry
>
> On Wed, Sep 16, 2015 at 2:10 AM, Hao Wang <billhao.l...@gmail.com> wrote:
>
>> Hi,
>>
>> I created a dataframe with 4 string columns (city, state, country,
>> zipcode).
>> I then a
Hi,
I created a dataframe with 4 string columns (city, state, country, zipcode).
I then applied the following nested schema to it by creating a custom
StructType. When I run df.take(5), it gives the exception below as expected.
The question is how I can convert the Rows in the dataframe to
Hi,
I created a dataframe with 4 string columns (city, state, country, zipcode).
I then applied the following nested schema to it by creating a custom
StructType. When I run df.take(5), it gives the exception below as expected.
The question is how I can convert the Rows in the dataframe to
.html
https://www.mail-archive.com/user@spark.apache.org/msg30204.html
On June 13, 2015, at 5:41 AM, Hao Wang bill...@gmail.com wrote:
Hi,
I have a bunch of large log files on Hadoop. Each line contains a log and its
severity. Is there a way that I can use Spark to split the entire data
I am currently using filter inside a loop of all severity levels to do this,
which I think is pretty inefficient. It has to read the entire data set once
for each severity. I wonder if there is a more efficient way that takes just
one pass of the data? Thanks.
Best,
Hao Wang
On Jun 13, 2015
] log2
[ERROR] log3
[INFO] log4
Output:
error_file
[ERROR] log1
[ERROR] log3
info_file
[INFO] log2
[INFO] log4
Best,
Hao Wang
Hi, all
Yes, it's a name of Wikipedia article. I am running WikipediaPageRank
example of Spark Bagels.
I am wondering whether there is any relation to buffer size of Kyro.
The page rank can be successfully finished, sometimes not because this kind
of Kyro exception happens too many times, which
, Shanghai, 200240
Email:wh.s...@gmail.com
On Wed, Jul 16, 2014 at 1:41 PM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Are you using classes from external libraries that have not been added to
the sparkContext, using sparkcontext.addJar()?
TD
On Tue, Jul 15, 2014 at 8:36 PM, Hao Wang
Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com
On Thu, Jul 17, 2014 at 2:58 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Is the class that is not found in the wikipediapagerank jar?
TD
On Wed, Jul 16, 2014 at 12:32 AM, Hao Wang wh.s...@gmail.com wrote:
Thanks
I am running the WikipediaPageRank in Spark example and share the same
problem with you:
4/07/16 11:31:06 DEBUG DAGScheduler: submitStage(Stage 6)
14/07/16 11:31:06 ERROR TaskSetManager: Task 6.0:450 failed 4 times;
aborting job
14/07/16 11:31:06 INFO DAGScheduler: Failed to run foreach at
Hi, Wei
You may try to set JVM opts in *spark-env.sh* as follow to prevent or
mitigate GC pause:
export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit -XX:+UseConcMarkSweepGC
-Xmx2g -XX:MaxPermSize=256m
There are more options you could add, please just Google :)
Regards,
Wang Hao(王灏)
CloudTeam |
Hi, All
In Spark the spark.driver.host is driver hostname in default, thus, akka
actor system will listen to a URL like akka.tcp://hostname:port. However,
when a user tries to use spark-submit to run application, the user may set
--master spark://192.168.1.12:7077.
Then, the *AppClient* in
Hi, all
When I try to run Spark PageRank using:
./bin/spark-submit \
--master spark://192.168.1.12:7077 \
--class org.apache.spark.examples.bagel.WikipediaPageRank \
~/Documents/Scala/WikiPageRank/target/scala-2.10/wikipagerank_2.10-1.0.jar \
hdfs://192.168.1.12:9000/freebase-13G 0.05 100
Hi, Laurent
You could set Spark.executor.memory and heap size by following methods:
1. in you conf/spark-env.sh:
*export SPARK_WORKER_MEMORY=38g*
*export SPARK_JAVA_OPTS=-XX:-UseGCOverheadLimit
-XX:+UseConcMarkSweepGC -Xmx2g -XX:MaxPermSize=256m*
2. you could also add modification for
Hi, all
Why does the Spark 1.0.0 official doc remove how to build Spark with
corresponding Hadoop version?
It means that if I don't need to specify the Hadoop version with I build my
Spark 1.0.0 with `sbt/sbt assembly`?
Regards,
Wang Hao(王灏)
CloudTeam | School of Software Engineering
Shanghai
Wang Hao,
This is not removed. We moved it here:
http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html
If you're building with SBT, and you don't specify the
SPARK_HADOOP_VERSION, then it defaults to 1.0.4.
Andrew
2014-06-12 6:24 GMT-07:00 Hao Wang wh.s...@gmail.com
Hi, Zuhair
According to my experience, you could try following steps to avoid Spark
OOM:
1. Increase JVM memory by adding export SPARK_JAVA_OPTS=-Xmx2g
2. Use .persist(storage.StorageLevel.MEMORY_AND_DISK) instead of .cache()
3. Have you set spark.executor.memory value? It's 512m by
Hi,
Oracle JDK and OpenJDK, which one is better or preferred for Spark?
Regards,
Wang Hao(王灏)
CloudTeam | School of Software Engineering
Shanghai Jiao Tong University
Address:800 Dongchuan Road, Minhang District, Shanghai, 200240
Email:wh.s...@gmail.com
Hi, all
*Spark version: bae07e3 [behind 1] fix different versions of commons-lang
dependency and apache/spark#746 addendum*
I have six worker nodes and four of them have this NoClassDefFoundError when
I use thestart-slaves.sh on my driver node. However, running ./bin/spark-class
20 matches
Mail list logo