A very Minor typo in the Spark paper

2015-12-11 Thread Fengdong Yu
Hi, I found a very minor typo in: http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf Page 4: We complement the data mining example in Section 2.2.1 with two iterative applications: logistic regression and PageRank. I read back to section 2.2.1, there is no these two examples.

Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API. but it’s not a big issue for your change, only com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java

I filed SPARK-12233

2015-12-08 Thread Fengdong Yu
Hi, I filed an issue, please take a look: https://issues.apache.org/jira/browse/SPARK-12233 It definitely can be reproduced.

Re: load multiple directory using dataframe load

2015-11-23 Thread Fengdong Yu
hiveContext.read.format(“orc”).load(“bypath/*”) > On Nov 24, 2015, at 1:07 PM, Renu Yadav wrote: > > Hi , > > I am using dataframe and want to load orc file using multiple directory > like this: > hiveContext.read.format.load("mypath/3660,myPath/3661") > > but it is not

Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Fengdong Yu
I can assess directly in China > On Nov 13, 2015, at 10:28 AM, Ted Yu wrote: > > I was able to access the following where response was fast: > > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN >

Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu
This is the most simplest announcement I saw. > On Nov 11, 2015, at 12:49 AM, Reynold Xin wrote: > > Hi All, > > Spark 1.5.2 is a maintenance release containing stability fixes. This release > is based on the branch-1.5 maintenance branch of Spark. We *strongly >

Re: How to get the HDFS path for each RDD

2015-09-27 Thread Fengdong Yu
Hi Anchit, cat you create more than one data in each dataset to test again? > On Sep 26, 2015, at 18:00, Fengdong Yu <fengdo...@everstring.com> wrote: > > Anchit, > > please ignore my inputs. you are right. Thanks. > > > >> On Sep 26, 2015, at 17:27, F

Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
> u'key1': u'value1'}, {u'key2': u'value2', 'source': > u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json'}] > > Similarly you could modify the function to return 'source' and 'date' with > some string manipulation per your requirements. > > Let me know if this helps. >

Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
Anchit, please ignore my inputs. you are right. Thanks. > On Sep 26, 2015, at 17:27, Fengdong Yu <fengdo...@everstring.com> wrote: > > Hi Anchit, > > this is not my expected, because you specified the HDFS directory in your > code. > I've solved like

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
gt; > > > Scala > > val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path") > > More info: > https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)] > > Let us kn

How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
Hi, I have multiple files with JSON format, such as: /data/test1_data/sub100/test.data /data/test2_data/sub200/test.data I can sc.textFile(“/data/*/*”) but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it the one target HDFS location. how to do it, Thanks.

Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
bring clarity to my thoughts? > > On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdo...@everstring.com > <mailto:fengdo...@everstring.com>> wrote: > Hi Anchit, > > Thanks for the quick answer. > > my exact question is : I want to add HDFS location into each line in

Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Fengdong Yu
Do you mean you want to publish the artifact to your private repository? if so, please using ‘sbt publish’ add the following in your build.sb: publishTo := { val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/; if (version.value.endsWith("SNAPSHOT")) Some("snapshots" at nexus +

Re: spark dataframe transform JSON to ORC meet “column ambigous exception”

2015-09-12 Thread Fengdong Yu
t; > Can you check your json input ? > > Thanks > > On Sat, Sep 12, 2015 at 2:05 AM, Fengdong Yu <fengdo...@everstring.com> > wrote: > >> Hi, >> >> I am using spark1.4.1 data frame, read JSON data, then save it to orc. >> the code is v