Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API. but it’s not a big issue for your change, only com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java

Re: getting error while persisting in hive

2015-12-09 Thread Fengdong Yu
.write not .write() > On Dec 9, 2015, at 5:37 PM, Divya Gehlot wrote: > > Hi, > I am using spark 1.4.1 . > I am getting error when persisting spark dataframe output to hive > scala> >

Re: Filtering records based on empty value of column in SparkSql

2015-12-09 Thread Fengdong Yu
val req_logs_with_dpid = req_logs.filter(req_logs("req_info.pid") != "" ) Azuryy Yu Sr. Infrastructure Engineer cel: 158-0164-9103 wetchat: azuryy On Wed, Dec 9, 2015 at 7:43 PM, Prashant Bhardwaj < prashant2006s...@gmail.com> wrote: > Hi > > I have two columns in my json which can have null,

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
> > <https://in.linkedin.com/in/ramkumarcs31> > > > On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>> wrote: > Can you detail your question? what looks like your previous batch and the > current bat

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
Can you detail your question? what looks like your previous batch and the current batch? > On Dec 8, 2015, at 3:52 PM, Ramkumar V wrote: > > Hi, > > I'm running java over spark in cluster mode. I want to apply filter on > javaRDD based on some previous batch

Re: About Spark On Hbase

2015-12-08 Thread Fengdong Yu
https://github.com/nerdammer/spark-hbase-connector This is better and easy to use. > On Dec 9, 2015, at 3:04 PM, censj wrote: > > hi all, > now I using spark,but I not found spark operation hbase open source. Do > any one tell me? >

Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-07 Thread Fengdong Yu
Can you try like this in your sbt: val spark_version = "1.5.2" val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact = "servlet-api") val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty") libraryDependencies ++= Seq( "org.apache.spark" %%

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
If your RDD is JSON format, that’s easy. val df = sqlContext.read.json(rdd) df.saveAsTable(“your_table_name") > On Dec 7, 2015, at 5:28 PM, Divya Gehlot wrote: > > Hi, > I am new bee to Spark. > Could somebody guide me how can I persist my spark RDD results in Hive

Re: persist spark output in hive using DataFrame and saveAsTable API

2015-12-07 Thread Fengdong Yu
I suppose your output data is “ORC”, and want to save to hive database: test, external table name is : testTable import scala.collection.immutable sqlContext.createExternalTable(“test.testTable", "org.apache.spark.sql.hive.orc", Map("path" -> “/data/test/mydata")) > On Dec 7, 2015, at 5:28

Re: python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Fengdong Yu
refer here: https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html of section: Example 4-27. Python custom partitioner > On Dec 8, 2015, at 10:07 AM, Keith Freeman <8fo...@gmail.com> wrote: > > I'm not a python expert, so I'm wondering if anybody has a

Re: Intersection of two sets by key - join vs filter + join

2015-12-06 Thread Fengdong Yu
Don’t do Join firstly. broadcast your small RDD, val bc = sc.broadcast(small_rdd) then large_dd.filter(x.key in bc.value).map( x => { bc.value.other_fileds + x }).distinct.groupByKey > On Dec 7, 2015, at 1:41 PM, Z Z wrote: > > I have two RDDs, one really

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
Yes. it results to a shuffle. > On Dec 4, 2015, at 6:04 PM, Stephen Boesch <java...@gmail.com> wrote: > > @Yu Fengdong: Your approach - specifically the groupBy results in a shuffle > does it not? > > 2015-12-04 2:02 GMT-08:00 Fengdong Yu <fengdo...@evers

Re: Avoid Shuffling on Partitioned Data

2015-12-04 Thread Fengdong Yu
There are many ways, one simple is: such as: you want to know how many rows for each month: sqlContext.read.parquet(“……../month=*”).select($“month").groupBy($”month”).count the output looks like: monthcount 201411100 201412200 hopes help. > On Dec 4, 2015, at 5:53 PM, Yiannis

Re: Low Latency SQL query

2015-12-01 Thread Fengdong Yu
It depends on many situations: 1) what’s your data format? csv(text) or ORC/parquet? 2) Did you have Data warehouse to summary/cluster your data? if your data is text or you query for the raw data, It should be slow, Spark cannot do much to optimize your job. > On Dec 2, 2015, at 9:21

Re: SparkSQL API to insert DataFrame into a static partition?

2015-12-01 Thread Fengdong Yu
Hi you can try: if your table under location “/test/table/“ on HDFS and has partitions: “/test/table/dt=2012” “/test/table/dt=2013” df.write.mode(SaveMode.Append).partitionBy("date”).save(“/test/table") > On Dec 2, 2015, at 10:50 AM, Isabelle Phan wrote: > >

Re: load multiple directory using dataframe load

2015-11-23 Thread Fengdong Yu
hiveContext.read.format(“orc”).load(“bypath/*”) > On Nov 24, 2015, at 1:07 PM, Renu Yadav wrote: > > Hi , > > I am using dataframe and want to load orc file using multiple directory > like this: > hiveContext.read.format.load("mypath/3660,myPath/3661") > > but it is not

Re: Dataframe constructor

2015-11-23 Thread Fengdong Yu
just simple as: val df = sqlContext.sql(“select * from table”) or val df = sqlContext.read.json(“hdfs_path”) > On Nov 24, 2015, at 3:09 AM, spark_user_2015 wrote: > > Dear all, > > is the following usage of the Dataframe constructor correct or does it > trigger any side

How to adjust Spark shell table width

2015-11-21 Thread Fengdong Yu
Hi, I found if the column value is too long, spark shell only show a partial result. such as: sqlContext.sql("select url from tableA”).show(10) it cannot show the whole URL here. so how to adjust it? Thanks - To

Re: spark with breeze error of NoClassDefFoundError

2015-11-18 Thread Fengdong Yu
The simplest way is remove all “provided” in your pom. then ‘sbt assembly” to build your final package. then get rid of ‘—jars’ because assembly already includes all dependencies. > On Nov 18, 2015, at 2:15 PM, Jack Yang wrote: > > So weird. Is there anything wrong with

Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Hi, we use ‘Airflow' as our job workflow scheduler. > On Nov 19, 2015, at 9:47 AM, Vikram Kone wrote: > > Hi Nick, > Quick question about spark-submit command executed from azkaban with command > job type. > I see that when I press kill in azkaban portal on a

Re: Spark job workflow engine recommendations

2015-11-18 Thread Fengdong Yu
Yes, you can submit job remotely. > On Nov 19, 2015, at 10:10 AM, Vikram Kone <vikramk...@gmail.com> wrote: > > Hi Feng, > Does airflow allow remote submissions of spark jobs via spark-submit? > > On Wed, Nov 18, 2015 at 6:01 PM, Fengdong Yu <fengdo...@evers

Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
Can you try : new PixelGenerator(startTime, endTime) ? > On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu wrote: > > I want to pass two parameters into new java class from rdd.mapPartitions(), > the code like following. > ---Source Code > > Main method: > > /*the

Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
Just make PixelGenerator as a nested static class? > On Nov 16, 2015, at 1:22 PM, Zhang, Jingyu wrote: > > Fengdong

Re: How to passing parameters to another java class

2015-11-15 Thread Fengdong Yu
6 November 2015 at 16:05, Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>> wrote: > Can you try : new PixelGenerator(startTime, endTime) ? > > > >> On Nov 16, 2015, at 12:47 PM, Zhang, Jingyu <jingyu.zh...@news.com.au >> <mailto:j

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
The code looks good. can you check your ‘import’ in your code? because it calls ‘honeywell.test’? > On Nov 16, 2015, at 3:02 PM, Yogesh Vyas wrote: > > Hi, > > While I am trying to read a json file using SQLContext, i get the > following error: > > Exception in

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
Ignore my inputs, I think HiveSpark.java is your main method located. can you paste the whole pom.xml and your code? > On Nov 16, 2015, at 3:39 PM, Fengdong Yu <fengdo...@everstring.com> wrote: > > The code looks good. can you check your ‘import’ in your code? beca

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
And, also make sure your scala version is 2.11 for your build. > On Nov 16, 2015, at 3:43 PM, Fengdong Yu <fengdo...@everstring.com> wrote: > > Ignore my inputs, I think HiveSpark.java is your main method located. > > can you paste the whole pom.xml and your code? >

Re: NoSuchMethodError

2015-11-15 Thread Fengdong Yu
what’s your SQL? > On Nov 16, 2015, at 3:02 PM, Yogesh Vyas wrote: > > Hi, > > While I am trying to read a json file using SQLContext, i get the > following error: > > Exception in thread "main" java.lang.NoSuchMethodError: >

Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu
This is the most simplest announcement I saw. > On Nov 11, 2015, at 12:49 AM, Reynold Xin wrote: > > Hi All, > > Spark 1.5.2 is a maintenance release containing stability fixes. This release > is based on the branch-1.5 maintenance branch of Spark. We *strongly >

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
6.0rc7 manually ? > On Nov 9, 2015, at 9:34 PM, swetha kasireddy <swethakasire...@gmail.com> > wrote: > > I am using the following: > > > > com.twitter > parquet-avro > 1.6.0 > > > On Mon, Nov 9, 2015 at 1:00 AM, Fengdong Yu <fengdo

Re: parquet.io.ParquetEncodingException Warning when trying to save parquet file in Spark

2015-11-09 Thread Fengdong Yu
Which Spark version used? It was fixed in Parquet-1.7x, so Spark-1.5.x will be work. > On Nov 9, 2015, at 3:43 PM, swetha wrote: > > Hi, > > I see unwanted Warning when I try to save a Parquet file in hdfs in Spark. > Please find below the code and the Warning

Re: There is any way to write from spark to HBase CDH4?

2015-10-27 Thread Fengdong Yu
Does this released with Spark1.*? or still kept in the trunk? > On Oct 27, 2015, at 6:22 PM, Adrian Tanase wrote: > > Also I just remembered about cloudera’s contribution > http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/ >

Re: spark to hbase

2015-10-27 Thread Fengdong Yu
Also, please remove the HBase related to the Scala Object, this will resolve the serialize issue and avoid open connection repeatedly. and remember close the table after the final flush. > On Oct 28, 2015, at 10:13 AM, Ted Yu wrote: > > For #2, have you checked task

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Fengdong Yu
How many partitions you generated? if Millions generated, then there is a huge memory consumed. > On Oct 26, 2015, at 10:58 AM, Jerry Lam wrote: > > Hi guys, > > I mentioned that the partitions are generated so I tried to read the > partition data from it. The driver

Re: Machine learning with spark (book code example error)

2015-10-14 Thread Fengdong Yu
Don’t recommend this code style, you’d better brace the function block. val testLabels = testRDD.map { case (file, text) => { val topic = file.split("/").takeRight(2).head newsgroupsMap(topic) } } > On Oct 14, 2015, at 15:46, Nick Pentreath wrote: > > Hi there.

Re: how to use SharedSparkContext

2015-10-14 Thread Fengdong Yu
oh, Yes. Thanks much. > On Oct 14, 2015, at 18:47, Akhil Das wrote: > > com.holdenkarau.spark.testing

Re: spark sql OOM

2015-10-14 Thread Fengdong Yu
Can you search the mail-archive before asked the question? at least search for how ask the question. nobody can give your answer if you don’t paste your SQL or SparkSQL code. > On Oct 14, 2015, at 17:40, Andy Zhao wrote: > > Hi guys, > > I'm testing sparkSql 1.5.1,

how to use SharedSparkContext

2015-10-12 Thread Fengdong Yu
Hi, How to add dependency in build.sbt if I want to use SharedSparkContext? I’ve added spark-core, but it doesn’t work.(cannot find SharedSparkContext) - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For