Re: Error from reading S3 in Scala

2016-05-04 Thread Zhang, Jingyu
oop-aws/index.html, >>> which you can set in your spark context by prefixing them with spark.hadoop. >>> >>> you can also set the env vars, AWS_ACCESS_KEY_ID and >>> AWS_SECRET_ACCESS_KEY; SparkEnv will pick these up and set the relevant >>> spark conte

Error from reading S3 in Scala

2016-05-02 Thread Zhang, Jingyu
Hi All, I am using Eclipse with Maven for developing Spark applications. I got a error for Reading from S3 in Scala but it works fine in Java when I run them in the same project in Eclipse. The Scala/Java code and the error in following Scala val uri = URI.create("s3a://" + key + ":" + seckey

Re: Scala vs Python for Spark ecosystem

2016-04-20 Thread Zhang, Jingyu
Graphx did not support Python yet. http://spark.apache.org/docs/latest/graphx-programming-guide.html The workaround solution is use graphframes (3rd party API), https://issues.apache.org/jira/browse/SPARK-3789 but some features in Python are not as same as Scala,

How many threads will be used to read RDBMS after set numPartitions =10 in Spark JDBC

2016-04-04 Thread Zhang, Jingyu
Hi All, I want read Mysql from Spark. Please let me know how many threads will be used to read the RDBMS after set numPartitions =10 in Spark JDBC. What is the best practice to read large dataset from RDBMS to Spark? Thanks, Jingyu -- This message and its attachments may contain legally

Re: Memory issues on spark

2016-02-17 Thread Zhang, Jingyu
May set "maximizeResourceAllocation", then EMR will do the best config for you. http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-spark-configure.html Jingyu On 18 February 2016 at 12:02, wrote: > Hi All, > > I have been facing memory issues in

Failed to remove broadcast 2 with removeFromMaster = true in Graphx

2016-02-05 Thread Zhang, Jingyu
I running a Pregel function with 37 nodes in EMR hadoop. After a hour logs show following. Can anyone please tell what the problem is and how do I solve it? Thanks 16/02/05 14:02:46 WARN BlockManagerMaster: Failed to remove broadcast 2 with removeFromMaster = true - Cannot receive any reply in

Unpersist RDD in Graphx

2016-01-31 Thread Zhang, Jingyu
Hi, What is he best way to unpersist the RDD in graphx to release memory? RDD.unpersist or RDD.unpersistVertices and RDD..edges.unpersist I study the source code of Pregel.scala, Both of above were used between line 148 and line 150. Can anyone please tell me what the different? In addition, what

How to filter the isolated vertexes in Graphx

2016-01-28 Thread Zhang, Jingyu
I try to filter vertexes that did not have any connection links with others. How to filter those isolated vertexes in Graphx? Thanks, Jingyu -- This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you

Graphx: How to print the group of connected components one by one

2015-12-01 Thread Zhang, Jingyu
Can anyone please let me know How to print all nodes in connected components one by one? graph.connectedComponents() e.g. connected Component ID Nodes ID 1 1,2,3 6 6,7,8,9 Thanks -- This

Size exceeds Integer.MAX_VALUE on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
I am using spark-csv to save files in s3, it shown Size exceeds. Please let me know how to fix it. Thanks. df.write() .format("com.databricks.spark.csv") .option("header", "true") .save("s3://newcars.csv"); java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at

Re: Size exceeds Integer.MAX_VALUE (SparkSQL$TreeNodeException: sort, tree) on EMR 4.0.0 Spark 1.4.1

2015-11-16 Thread Zhang, Jingyu
rk.sql.execution.Sort$$anonfun$doExecute$5.apply(basicOperators.scala:190) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:48) ... 21 more On 16 November 2015 at 21:16, Zhang, Jingyu <jingyu.zh...@news.com.au> wrote: > I am using spark-csv to save

Re: How to passing parameters to another java class

2015-11-15 Thread Zhang, Jingyu
fengdo...@everstring.com> wrote: > If you got “cannot Serialized” Exception, then you need to > PixelGenerator as a Static class. > > > > > On Nov 16, 2015, at 1:10 PM, Zhang, Jingyu <jingyu.zh...@news.com.au> > wrote: > > Thanks, that worked for local environm

How to passing parameters to another java class

2015-11-15 Thread Zhang, Jingyu
I want to pass two parameters into new java class from rdd.mapPartitions(), the code like following. ---Source Code Main method: /*the parameters that I want to pass into the PixelGenerator.class for selecting any items between the startTime and the endTime. */ int startTime, endTime;

Re: How to passing parameters to another java class

2015-11-15 Thread Zhang, Jingyu
Thanks, that worked for local environment but not in the Spark Cluster. On 16 November 2015 at 16:05, Fengdong Yu <fengdo...@everstring.com> wrote: > Can you try : new PixelGenerator(startTime, endTime) ? > > > > On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu <jingyu.zh

Spark-csv error on read AWS s3a in spark 1.4.1

2015-11-10 Thread Zhang, Jingyu
A small csv file in S3. I use s3a://key:seckey@bucketname/a.csv It works for SparkContext pixelsStr: SparkContext = ctx.textFile(s3pathOrg); It works for Java Spark-csv as well Java code : DataFrame careerOneDF = sqlContext.read().format( "com.databricks.spark.csv")

key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Zhang, Jingyu
There is not a problem in Spark SQL 1.5.1 but the error of "key not found: sportingpulse.com" shown up when I use 1.5.0. I have to use the version of 1.5.0 because that the one AWS EMR support. Can anyone tell me why Spark uses "sportingpulse.com" and how to fix it? Thanks. Caused by:

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Zhang, Jingyu
com> > Sent: ‎10/‎30/‎2015 6:33 PM > To: Zhang, Jingyu <jingyu.zh...@news.com.au> > Cc: user <user@spark.apache.org> > Subject: Re: key not found: sportingpulse.com in Spark SQL 1.5.0 > > I searched for sportingpulse in *.scala and *.java files under 1.5 branch. >

Re: Save data to different S3

2015-10-29 Thread Zhang, Jingyu
Try s3://aws_key:aws_secret@bucketName/folderName with your access key and secret to save the data. On 30 October 2015 at 10:55, William Li wrote: > Hi – I have a simple app running fine with Spark, it reads data from S3 > and performs calculation. > > When reading data from

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Zhang, Jingyu
me with just a single row? > Do you rows have any columns with null values? > Can you post a code snippet here on how you load/generate the dataframe? > Does dataframe.rdd.cache work? > > *Romi Kuntsman*, *Big Data Engineer* > http://www.totango.com > > On Thu,

NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-28 Thread Zhang, Jingyu
It is not a problem to use JavaRDD.cache() for 200M data (all Objects read form Json Format). But when I try to use DataFrame.cache(), It shown exception in below. My machine can cache 1 G data in Avro format without any problem. 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in

Exception on save s3n file (1.4.1, hadoop 2.6)

2015-09-24 Thread Zhang, Jingyu
I got following exception when I run JavPairRDD.values().saveAsTextFile("s3n://bucket); Can anyone help me out? thanks 15/09/25 12:24:32 INFO SparkContext: Successfully stopped SparkContext Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/ServiceException at

How to get RDD from PairRDD<key,value> in Java

2015-09-23 Thread Zhang, Jingyu
Hi All, I want to extract the "value" RDD from PairRDD in Java Please let me know how can I get it easily. Thanks Jingyu -- This message and its attachments may contain legally privileged or confidential information. It is intended solely for the named addressee. If you are not

caching DataFrames

2015-09-23 Thread Zhang, Jingyu
I have A and B DataFrames A has columns a11,a12, a21,a22 B has columns b11,b12, b21,b22 I persistent them in cache 1. A.Cache(), 2. B.Cache() Then, I persistent the subset in cache later 3. DataFrame A1 (a11,a12).cache() 4. DataFrame B1 (b11,b12).cache() 5. DataFrame AB1

Re: Java Heap Space Error

2015-09-23 Thread Zhang, Jingyu
Is you sql works if do not runs a regex on strings and concatenates them, I mean just Select the stuff without String operations? On 24 September 2015 at 10:11, java8964 wrote: > Try to increase partitions count, that will make each partition has less > data. > > Yong > >

Re: caching DataFrames

2015-09-23 Thread Zhang, Jingyu
the data that they have. So for your A1 and B1 >> you would need extra memory that would be equivalent to half the memory of >> A/B. >> >> You can check the storage that a dataFrame is consuming in the Spark UI's >> Storage tab. http://host:4040/storage/ >> >>

Re: word count (group by users) in spark

2015-09-21 Thread Zhang, Jingyu
Spark spills data to disk when there is more data shuffled onto a single executor machine than can fit in memory. However, it flushes out the data to disk one key at a time - so if a single key has more key-value pairs than can fit in memory, an out of memory exception occurs. Cheers, Jingyu On