oop-aws/index.html,
>>> which you can set in your spark context by prefixing them with spark.hadoop.
>>>
>>> you can also set the env vars, AWS_ACCESS_KEY_ID and
>>> AWS_SECRET_ACCESS_KEY; SparkEnv will pick these up and set the relevant
>>> spark conte
Hi All,
I am using Eclipse with Maven for developing Spark applications. I got a
error for Reading from S3 in Scala but it works fine in Java when I run
them in the same project in Eclipse. The Scala/Java code and the error in
following
Scala
val uri = URI.create("s3a://" + key + ":" + seckey
Graphx did not support Python yet.
http://spark.apache.org/docs/latest/graphx-programming-guide.html
The workaround solution is use graphframes (3rd party API),
https://issues.apache.org/jira/browse/SPARK-3789
but some features in Python are not as same as Scala,
Hi All,
I want read Mysql from Spark. Please let me know how many threads will be
used to read the RDBMS after set numPartitions =10 in Spark JDBC. What is
the best practice to read large dataset from RDBMS to Spark?
Thanks,
Jingyu
--
This message and its attachments may contain legally
May set "maximizeResourceAllocation", then EMR will do the best config for
you.
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html
Jingyu
On 18 February 2016 at 12:02, wrote:
> Hi All,
>
> I have been facing memory issues in
I running a Pregel function with 37 nodes in EMR hadoop. After a hour
logs show following. Can anyone please tell what the problem is and
how do I solve it? Thanks
16/02/05 14:02:46 WARN BlockManagerMaster: Failed to remove broadcast
2 with removeFromMaster = true - Cannot receive any reply in
Hi, What is he best way to unpersist the RDD in graphx to release memory?
RDD.unpersist
or
RDD.unpersistVertices and RDD..edges.unpersist
I study the source code of Pregel.scala, Both of above were used between
line 148 and line 150. Can anyone please tell me what the different? In
addition, what
I try to filter vertexes that did not have any connection links with
others. How to filter those isolated vertexes in Graphx?
Thanks,
Jingyu
--
This message and its attachments may contain legally privileged or
confidential information. It is intended solely for the named addressee. If
you
Can anyone please let me know How to print all nodes in connected
components one by one?
graph.connectedComponents()
e.g.
connected Component ID Nodes ID
1 1,2,3
6 6,7,8,9
Thanks
--
This
I am using spark-csv to save files in s3, it shown Size exceeds.
Please let me know how to fix it. Thanks.
df.write()
.format("com.databricks.spark.csv")
.option("header", "true")
.save("s3://newcars.csv");
java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at
rk.sql.execution.Sort$$anonfun$doExecute$5.apply(basicOperators.scala:190)
at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48)
... 21 more
On 16 November 2015 at 21:16, Zhang, Jingyu <jingyu.zh...@news.com.au>
wrote:
> I am using spark-csv to save
fengdo...@everstring.com> wrote:
> If you got “cannot Serialized” Exception, then you need to
> PixelGenerator as a Static class.
>
>
>
>
> On Nov 16, 2015, at 1:10 PM, Zhang, Jingyu <jingyu.zh...@news.com.au>
> wrote:
>
> Thanks, that worked for local environm
I want to pass two parameters into new java class from rdd.mapPartitions(),
the code like following.
---Source Code
Main method:
/*the parameters that I want to pass into the PixelGenerator.class for
selecting any items between the startTime and the endTime.
*/
int startTime, endTime;
Thanks, that worked for local environment but not in the Spark Cluster.
On 16 November 2015 at 16:05, Fengdong Yu <fengdo...@everstring.com> wrote:
> Can you try : new PixelGenerator(startTime, endTime) ?
>
>
>
> On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu <jingyu.zh
A small csv file in S3. I use s3a://key:seckey@bucketname/a.csv
It works for SparkContext
pixelsStr: SparkContext = ctx.textFile(s3pathOrg);
It works for Java Spark-csv as well
Java code : DataFrame careerOneDF = sqlContext.read().format(
"com.databricks.spark.csv")
There is not a problem in Spark SQL 1.5.1 but the error of "key not found:
sportingpulse.com" shown up when I use 1.5.0.
I have to use the version of 1.5.0 because that the one AWS EMR support.
Can anyone tell me why Spark uses "sportingpulse.com" and how to fix it?
Thanks.
Caused by:
com>
> Sent: 10/30/2015 6:33 PM
> To: Zhang, Jingyu <jingyu.zh...@news.com.au>
> Cc: user <user@spark.apache.org>
> Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0
>
> I searched for sportingpulse in *.scala and *.java files under 1.5 branch.
>
Try s3://aws_key:aws_secret@bucketName/folderName with your access key and
secret to save the data.
On 30 October 2015 at 10:55, William Li wrote:
> Hi – I have a simple app running fine with Spark, it reads data from S3
> and performs calculation.
>
> When reading data from
me with just a single row?
> Do you rows have any columns with null values?
> Can you post a code snippet here on how you load/generate the dataframe?
> Does dataframe.rdd.cache work?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Thu,
It is not a problem to use JavaRDD.cache() for 200M data (all Objects read
form Json Format). But when I try to use DataFrame.cache(), It shown
exception in below.
My machine can cache 1 G data in Avro format without any problem.
15/10/29 13:26:23 INFO GeneratePredicate: Code generated in
I got following exception when I run
JavPairRDD.values().saveAsTextFile("s3n://bucket);
Can anyone help me out? thanks
15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" java.lang.NoClassDefFoundError:
org/jets3t/service/ServiceException
at
Hi All,
I want to extract the "value" RDD from PairRDD in Java
Please let me know how can I get it easily.
Thanks
Jingyu
--
This message and its attachments may contain legally privileged or
confidential information. It is intended solely for the named addressee. If
you are not
I have A and B DataFrames
A has columns a11,a12, a21,a22
B has columns b11,b12, b21,b22
I persistent them in cache
1. A.Cache(),
2. B.Cache()
Then, I persistent the subset in cache later
3. DataFrame A1 (a11,a12).cache()
4. DataFrame B1 (b11,b12).cache()
5. DataFrame AB1
Is you sql works if do not runs a regex on strings and concatenates them, I
mean just Select the stuff without String operations?
On 24 September 2015 at 10:11, java8964 wrote:
> Try to increase partitions count, that will make each partition has less
> data.
>
> Yong
>
>
the data that they have. So for your A1 and B1
>> you would need extra memory that would be equivalent to half the memory of
>> A/B.
>>
>> You can check the storage that a dataFrame is consuming in the Spark UI's
>> Storage tab. http://host:4040/storage/
>>
>>
Spark spills data to disk when there is more data shuffled onto a single
executor machine than can fit in memory. However, it flushes out the data
to disk one key at a time - so if a single key has more key-value pairs
than can fit in memory, an out of memory exception occurs.
Cheers,
Jingyu
On
26 matches
Mail list logo