A very Minor typo in the Spark paper

2015-12-11 Thread Fengdong Yu
Hi,

I found a very minor typo in:
http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf


Page 4:
We  complement the data mining example in Section 2.2.1 with two iterative 
applications: logistic regression and PageRank.


I read back to section 2.2.1, there is no these two examples. actually, It 
should be Section 3.2.1 and 3.2.2

@Matei


Thanks






Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Fengdong Yu
I don’t think there is performance difference between 1.x API and 2.x API.

but it’s not a big issue for your change, only 
com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java 

 need to change, right?

It’s not a big change to 2.x API. if you agree, I can do, but I cannot promise 
the time within one or two weeks because of my daily job.





> On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon  wrote:
> 
> Hi all, 
> 
> I am writing this email to both user-group and dev-group since this is 
> applicable to both.
> 
> I am now working on Spark XML datasource 
> (https://github.com/databricks/spark-xml 
> ).
> This uses a InputFormat implementation which I downgraded to Hadoop 1.x for 
> version compatibility.
> 
> However, I found all the internal JSON datasource and others in Databricks 
> use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the 
> method for this because TaskAttemptContext is a class in Hadoop 1.x and an 
> interface in Hadoop 2.x.
> 
> So, I looked through the codes for some advantages for Hadoop 2.x API but I 
> couldn't.
> I wonder if there are some advantages for using Hadoop 2.x API.
> 
> I understand that it is still preferable to use Hadoop 2.x APIs at least for 
> future differences but somehow I feel like it might not have to use Hadoop 
> 2.x by reflecting a method.
> 
> I would appreciate that if you leave a comment here 
> https://github.com/databricks/spark-xml/pull/14 
>  as well as sending back a 
> reply if there is a good explanation
> 
> Thanks! 



I filed SPARK-12233

2015-12-08 Thread Fengdong Yu
Hi,

I filed an issue, please take a look:

https://issues.apache.org/jira/browse/SPARK-12233


It definitely can be reproduced.









Re: load multiple directory using dataframe load

2015-11-23 Thread Fengdong Yu
hiveContext.read.format(“orc”).load(“bypath/*”)



> On Nov 24, 2015, at 1:07 PM, Renu Yadav  wrote:
> 
> Hi ,
> 
> I am using dataframe and want to load orc file using multiple directory
> like this:
> hiveContext.read.format.load("mypath/3660,myPath/3661")
> 
> but it is not working.
> 
> Please suggest how to achieve this
> 
> Thanks & Regards,
> Renu Yadav


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Seems jenkins is down (or very slow)?

2015-11-12 Thread Fengdong Yu
I can assess directly in China



> On Nov 13, 2015, at 10:28 AM, Ted Yu  wrote:
> 
> I was able to access the following where response was fast:
> 
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN 
> 
> 
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45806/ 
> 
> 
> Cheers
> 
> On Thu, Nov 12, 2015 at 6:21 PM, Yin Huai  > wrote:
> Hi Guys,
> 
> Seems Jenkins is down or very slow? Does anyone else experience it or just me?
> 
> Thanks,
> 
> Yin
> 



Re: [ANNOUNCE] Announcing Spark 1.5.2

2015-11-10 Thread Fengdong Yu
This is the most simplest announcement I saw.



> On Nov 11, 2015, at 12:49 AM, Reynold Xin  wrote:
> 
> Hi All,
> 
> Spark 1.5.2 is a maintenance release containing stability fixes. This release 
> is based on the branch-1.5 maintenance branch of Spark. We *strongly 
> recommend* all 1.5.x users to upgrade to this release.
> 
> The full list of bug fixes is here: http://s.apache.org/spark-1.5.2 
> 
> 
> http://spark.apache.org/releases/spark-release-1-5-2.html 
> 
> 
> 



Re: How to get the HDFS path for each RDD

2015-09-27 Thread Fengdong Yu
Hi Anchit, 
cat you create more than one data in each dataset to test again?



> On Sep 26, 2015, at 18:00, Fengdong Yu <fengdo...@everstring.com> wrote:
> 
> Anchit,
> 
> please ignore my inputs. you are right. Thanks.
> 
> 
> 
>> On Sep 26, 2015, at 17:27, Fengdong Yu <fengdo...@everstring.com 
>> <mailto:fengdo...@everstring.com>> wrote:
>> 
>> Hi Anchit,
>> 
>> this is not my expected, because you specified the HDFS directory in your 
>> code.
>> I've solved like this:
>> 
>>val text = sc.hadoopFile(Args.input,
>>classOf[TextInputFormat], classOf[LongWritable], 
>> classOf[Text], 2)
>> val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
>> 
>>   hadoopRdd.mapPartitionsWithInputSplit((inputSplit, iterator) => {
>>   val file = inputSplit.asInstanceOf[FileSplit]
>>   terator.map ( tp => {tp._1, new Text(file.toString + “,” + 
>> tp._2.toString)})
>>   }
>> 
>> 
>> 
>> 
>>> On Sep 25, 2015, at 13:12, Anchit Choudhry <anchit.choud...@gmail.com 
>>> <mailto:anchit.choud...@gmail.com>> wrote:
>>> 
>>> Hi Fengdong,
>>> 
>>> So I created two files in HDFS under a test folder.
>>> 
>>> test/dt=20100101.json
>>> { "key1" : "value1" }
>>> 
>>> test/dt=20100102.json
>>> { "key2" : "value2" }
>>> 
>>> Then inside PySpark shell
>>> 
>>> rdd = sc.wholeTextFiles('./test/*')
>>> rdd.collect()
>>> [(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', u'{ "key1" : 
>>> "value1" }), (u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json', 
>>> u'{ "key2" : "value2" })]
>>> import json
>>> def editMe(y, x):
>>>   j = json.loads(y)
>>>   j['source'] = x
>>>   return j
>>> 
>>> rdd.map(lambda (x,y): editMe(y,x)).collect()
>>> [{'source': u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', 
>>> u'key1': u'value1'}, {u'key2': u'value2', 'source': 
>>> u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json'}]
>>> 
>>> Similarly you could modify the function to return 'source' and 'date' with 
>>> some string manipulation per your requirements.
>>> 
>>> Let me know if this helps.
>>> 
>>> Thanks,
>>> Anchit
>>> 
>>> 
>>> On 24 September 2015 at 23:55, Fengdong Yu <fengdo...@everstring.com 
>>> <mailto:fengdo...@everstring.com>> wrote:
>>> 
>>> yes. such as I have two data sets:
>>> 
>>> date set A: /data/test1/dt=20100101
>>> data set B: /data/test2/dt=20100202
>>> 
>>> 
>>> all data has the same JSON format , such as:
>>> {“key1” : “value1”, “key2” : “value2” }
>>> 
>>> 
>>> my output expected:
>>> {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : 
>>> “20100101"}
>>> {“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” : 
>>> “20100202"}
>>> 
>>> 
>>>> On Sep 25, 2015, at 11:52, Anchit Choudhry <anchit.choud...@gmail.com 
>>>> <mailto:anchit.choud...@gmail.com>> wrote:
>>>> 
>>>> Sure. May I ask for a sample input(could be just few lines) and the output 
>>>> you are expecting to bring clarity to my thoughts?
>>>> 
>>>> On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdo...@everstring.com 
>>>> <mailto:fengdo...@everstring.com>> wrote:
>>>> Hi Anchit, 
>>>> 
>>>> Thanks for the quick answer.
>>>> 
>>>> my exact question is : I want to add HDFS location into each line in my 
>>>> JSON  data.
>>>> 
>>>> 
>>>> 
>>>>> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com 
>>>>> <mailto:anchit.choud...@gmail.com>> wrote:
>>>>> 
>>>>> Hi Fengdong,
>>>>> 
>>>>> Thanks for your question.
>>>>> 
>>>>> Spark already has a function called wholeTextFiles within sparkContext 
>>>>> which can help you with that:
>>>>> 
>>>>> Python
>>>>> hdfs://a-hdfs-path/part-0
>>>>> hdfs://a-hdfs-path/part-1
>>>>

Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
Hi Anchit,

this is not my expected, because you specified the HDFS directory in your code.
I've solved like this:

   val text = sc.hadoopFile(Args.input,
   classOf[TextInputFormat], classOf[LongWritable], 
classOf[Text], 2)
val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]

  hadoopRdd.mapPartitionsWithInputSplit((inputSplit, iterator) => {
  val file = inputSplit.asInstanceOf[FileSplit]
  terator.map ( tp => {tp._1, new Text(file.toString + “,” + 
tp._2.toString)})
  }




> On Sep 25, 2015, at 13:12, Anchit Choudhry <anchit.choud...@gmail.com> wrote:
> 
> Hi Fengdong,
> 
> So I created two files in HDFS under a test folder.
> 
> test/dt=20100101.json
> { "key1" : "value1" }
> 
> test/dt=20100102.json
> { "key2" : "value2" }
> 
> Then inside PySpark shell
> 
> rdd = sc.wholeTextFiles('./test/*')
> rdd.collect()
> [(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', u'{ "key1" : 
> "value1" }), (u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json', u'{ 
> "key2" : "value2" })]
> import json
> def editMe(y, x):
>   j = json.loads(y)
>   j['source'] = x
>   return j
> 
> rdd.map(lambda (x,y): editMe(y,x)).collect()
> [{'source': u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', 
> u'key1': u'value1'}, {u'key2': u'value2', 'source': 
> u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json'}]
> 
> Similarly you could modify the function to return 'source' and 'date' with 
> some string manipulation per your requirements.
> 
> Let me know if this helps.
> 
> Thanks,
> Anchit
> 
> 
> On 24 September 2015 at 23:55, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> 
> yes. such as I have two data sets:
> 
> date set A: /data/test1/dt=20100101
> data set B: /data/test2/dt=20100202
> 
> 
> all data has the same JSON format , such as:
> {“key1” : “value1”, “key2” : “value2” }
> 
> 
> my output expected:
> {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : 
> “20100101"}
> {“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” : 
> “20100202"}
> 
> 
>> On Sep 25, 2015, at 11:52, Anchit Choudhry <anchit.choud...@gmail.com 
>> <mailto:anchit.choud...@gmail.com>> wrote:
>> 
>> Sure. May I ask for a sample input(could be just few lines) and the output 
>> you are expecting to bring clarity to my thoughts?
>> 
>> On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdo...@everstring.com 
>> <mailto:fengdo...@everstring.com>> wrote:
>> Hi Anchit, 
>> 
>> Thanks for the quick answer.
>> 
>> my exact question is : I want to add HDFS location into each line in my JSON 
>>  data.
>> 
>> 
>> 
>>> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com 
>>> <mailto:anchit.choud...@gmail.com>> wrote:
>>> 
>>> Hi Fengdong,
>>> 
>>> Thanks for your question.
>>> 
>>> Spark already has a function called wholeTextFiles within sparkContext 
>>> which can help you with that:
>>> 
>>> Python
>>> hdfs://a-hdfs-path/part-0
>>> hdfs://a-hdfs-path/part-1
>>> ...
>>> hdfs://a-hdfs-path/part-n
>>> rdd = sparkContext.wholeTextFiles(“hdfs://a- <>hdfs-path”)
>>> (a-hdfs-path/part-0, its content)
>>> (a-hdfs-path/part-1, its content)
>>> ...
>>> (a-hdfs-path/part-n, its content)
>>> More info: http://spark 
>>> <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>>> 
>>> 
>>> 
>>> Scala
>>> 
>>> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>>> 
>>> More info: 
>>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>>>  
>>> Let us know if this helps or you need more help.
>>> 
>>> Thanks,
>>> Anchit Choudhry
>>> 
>>> On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com 
>>> <mailto:fengdo...@everstring.com>> wrote:
>>> Hi,
>>> 
>>> I have  multiple files with JSON format, such as:
>>> 
>>> /data/test1_data/sub100/test.data
>>> /data/test2_data/sub200/test.data
>>> 
>>> 
>>> I can sc.textFile(“/data/*/*”)
>>> 
>>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save 
>>> it the one target HDFS location.
>>> 
>>> how to do it, Thanks.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>>> <mailto:dev-unsubscr...@spark.apache.org>
>>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>> <mailto:dev-h...@spark.apache.org>
>>> 
>>> 
>> 
> 
> 



Re: How to get the HDFS path for each RDD

2015-09-26 Thread Fengdong Yu
Anchit,

please ignore my inputs. you are right. Thanks.



> On Sep 26, 2015, at 17:27, Fengdong Yu <fengdo...@everstring.com> wrote:
> 
> Hi Anchit,
> 
> this is not my expected, because you specified the HDFS directory in your 
> code.
> I've solved like this:
> 
>val text = sc.hadoopFile(Args.input,
>classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text], 2)
> val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
> 
>   hadoopRdd.mapPartitionsWithInputSplit((inputSplit, iterator) => {
>   val file = inputSplit.asInstanceOf[FileSplit]
>   terator.map ( tp => {tp._1, new Text(file.toString + “,” + 
> tp._2.toString)})
>   }
> 
> 
> 
> 
>> On Sep 25, 2015, at 13:12, Anchit Choudhry <anchit.choud...@gmail.com 
>> <mailto:anchit.choud...@gmail.com>> wrote:
>> 
>> Hi Fengdong,
>> 
>> So I created two files in HDFS under a test folder.
>> 
>> test/dt=20100101.json
>> { "key1" : "value1" }
>> 
>> test/dt=20100102.json
>> { "key2" : "value2" }
>> 
>> Then inside PySpark shell
>> 
>> rdd = sc.wholeTextFiles('./test/*')
>> rdd.collect()
>> [(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', u'{ "key1" : 
>> "value1" }), (u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json', 
>> u'{ "key2" : "value2" })]
>> import json
>> def editMe(y, x):
>>   j = json.loads(y)
>>   j['source'] = x
>>   return j
>> 
>> rdd.map(lambda (x,y): editMe(y,x)).collect()
>> [{'source': u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', 
>> u'key1': u'value1'}, {u'key2': u'value2', 'source': 
>> u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json'}]
>> 
>> Similarly you could modify the function to return 'source' and 'date' with 
>> some string manipulation per your requirements.
>> 
>> Let me know if this helps.
>> 
>> Thanks,
>> Anchit
>> 
>> 
>> On 24 September 2015 at 23:55, Fengdong Yu <fengdo...@everstring.com 
>> <mailto:fengdo...@everstring.com>> wrote:
>> 
>> yes. such as I have two data sets:
>> 
>> date set A: /data/test1/dt=20100101
>> data set B: /data/test2/dt=20100202
>> 
>> 
>> all data has the same JSON format , such as:
>> {“key1” : “value1”, “key2” : “value2” }
>> 
>> 
>> my output expected:
>> {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : 
>> “20100101"}
>> {“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” : 
>> “20100202"}
>> 
>> 
>>> On Sep 25, 2015, at 11:52, Anchit Choudhry <anchit.choud...@gmail.com 
>>> <mailto:anchit.choud...@gmail.com>> wrote:
>>> 
>>> Sure. May I ask for a sample input(could be just few lines) and the output 
>>> you are expecting to bring clarity to my thoughts?
>>> 
>>> On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdo...@everstring.com 
>>> <mailto:fengdo...@everstring.com>> wrote:
>>> Hi Anchit, 
>>> 
>>> Thanks for the quick answer.
>>> 
>>> my exact question is : I want to add HDFS location into each line in my 
>>> JSON  data.
>>> 
>>> 
>>> 
>>>> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com 
>>>> <mailto:anchit.choud...@gmail.com>> wrote:
>>>> 
>>>> Hi Fengdong,
>>>> 
>>>> Thanks for your question.
>>>> 
>>>> Spark already has a function called wholeTextFiles within sparkContext 
>>>> which can help you with that:
>>>> 
>>>> Python
>>>> hdfs://a-hdfs-path/part-0
>>>> hdfs://a-hdfs-path/part-1
>>>> ...
>>>> hdfs://a-hdfs-path/part-n
>>>> rdd = sparkContext.wholeTextFiles(“hdfs://a- <>hdfs-path”)
>>>> (a-hdfs-path/part-0, its content)
>>>> (a-hdfs-path/part-1, its content)
>>>> ...
>>>> (a-hdfs-path/part-n, its content)
>>>> More info: http://spark 
>>>> <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>>>> 
>>>> 
>>>> 
>>>> Scala
>>>> 
>>>> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>>>> 
>>>> More info: 
>>>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>>>>  
>>>> Let us know if this helps or you need more help.
>>>> 
>>>> Thanks,
>>>> Anchit Choudhry
>>>> 
>>>> On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com 
>>>> <mailto:fengdo...@everstring.com>> wrote:
>>>> Hi,
>>>> 
>>>> I have  multiple files with JSON format, such as:
>>>> 
>>>> /data/test1_data/sub100/test.data
>>>> /data/test2_data/sub200/test.data
>>>> 
>>>> 
>>>> I can sc.textFile(“/data/*/*”)
>>>> 
>>>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save 
>>>> it the one target HDFS location.
>>>> 
>>>> how to do it, Thanks.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> -
>>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>>>> <mailto:dev-unsubscr...@spark.apache.org>
>>>> For additional commands, e-mail: dev-h...@spark.apache.org 
>>>> <mailto:dev-h...@spark.apache.org>
>>>> 
>>>> 
>>> 
>> 
>> 
> 



Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
Hi Anchit, 

Thanks for the quick answer.

my exact question is : I want to add HDFS location into each line in my JSON  
data.


> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com> wrote:
> 
> Hi Fengdong,
> 
> Thanks for your question.
> 
> Spark already has a function called wholeTextFiles within sparkContext which 
> can help you with that:
> 
> Python
> hdfs://a-hdfs-path/part-0
> hdfs://a-hdfs-path/part-1
> ...
> hdfs://a-hdfs-path/part-n
> rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
> (a-hdfs-path/part-0, its content)
> (a-hdfs-path/part-1, its content)
> ...
> (a-hdfs-path/part-n, its content)
> More info: http://spark 
> <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
> 
> 
> 
> Scala
> 
> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
> 
> More info: 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>  
> Let us know if this helps or you need more help.
> 
> Thanks,
> Anchit Choudhry
> 
> On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Hi,
> 
> I have  multiple files with JSON format, such as:
> 
> /data/test1_data/sub100/test.data
> /data/test2_data/sub200/test.data
> 
> 
> I can sc.textFile(“/data/*/*”)
> 
> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it 
> the one target HDFS location.
> 
> how to do it, Thanks.
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> <mailto:dev-unsubscr...@spark.apache.org>
> For additional commands, e-mail: dev-h...@spark.apache.org 
> <mailto:dev-h...@spark.apache.org>
> 
> 



How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu
Hi,

I have  multiple files with JSON format, such as:

/data/test1_data/sub100/test.data
/data/test2_data/sub200/test.data


I can sc.textFile(“/data/*/*”)

but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save it 
the one target HDFS location. 

how to do it, Thanks.






-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: How to get the HDFS path for each RDD

2015-09-24 Thread Fengdong Yu

yes. such as I have two data sets:

date set A: /data/test1/dt=20100101
data set B: /data/test2/dt=20100202


all data has the same JSON format , such as:
{“key1” : “value1”, “key2” : “value2” }


my output expected:
{“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” : “20100101"}
{“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” : “20100202"}


> On Sep 25, 2015, at 11:52, Anchit Choudhry <anchit.choud...@gmail.com> wrote:
> 
> Sure. May I ask for a sample input(could be just few lines) and the output 
> you are expecting to bring clarity to my thoughts?
> 
> On Thu, Sep 24, 2015, 23:44 Fengdong Yu <fengdo...@everstring.com 
> <mailto:fengdo...@everstring.com>> wrote:
> Hi Anchit, 
> 
> Thanks for the quick answer.
> 
> my exact question is : I want to add HDFS location into each line in my JSON  
> data.
> 
> 
> 
>> On Sep 25, 2015, at 11:25, Anchit Choudhry <anchit.choud...@gmail.com 
>> <mailto:anchit.choud...@gmail.com>> wrote:
>> 
>> Hi Fengdong,
>> 
>> Thanks for your question.
>> 
>> Spark already has a function called wholeTextFiles within sparkContext which 
>> can help you with that:
>> 
>> Python
>> hdfs://a-hdfs-path/part-0
>> hdfs://a-hdfs-path/part-1
>> ...
>> hdfs://a-hdfs-path/part-n
>> rdd = sparkContext.wholeTextFiles(“hdfs://a- <>hdfs-path”)
>> (a-hdfs-path/part-0, its content)
>> (a-hdfs-path/part-1, its content)
>> ...
>> (a-hdfs-path/part-n, its content)
>> More info: http://spark 
>> <http://spark/>.apache.org/docs/latest/api/python/pyspark.html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>> 
>> 
>> 
>> Scala
>> 
>> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>> 
>> More info: 
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD[(String,String)]
>>  
>> Let us know if this helps or you need more help.
>> 
>> Thanks,
>> Anchit Choudhry
>> 
>> On 24 September 2015 at 23:12, Fengdong Yu <fengdo...@everstring.com 
>> <mailto:fengdo...@everstring.com>> wrote:
>> Hi,
>> 
>> I have  multiple files with JSON format, such as:
>> 
>> /data/test1_data/sub100/test.data
>> /data/test2_data/sub200/test.data
>> 
>> 
>> I can sc.textFile(“/data/*/*”)
>> 
>> but I want to add the {“source” : “HDFS_LOCATION”} to each line, then save 
>> it the one target HDFS location.
>> 
>> how to do it, Thanks.
>> 
>> 
>> 
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
>> <mailto:dev-unsubscr...@spark.apache.org>
>> For additional commands, e-mail: dev-h...@spark.apache.org 
>> <mailto:dev-h...@spark.apache.org>
>> 
>> 
> 



Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Fengdong Yu
Do you mean you want to publish the artifact to your private repository?

if so, please using ‘sbt publish’

add the following in your build.sb:

publishTo := {
  val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
  if (version.value.endsWith("SNAPSHOT"))
Some("snapshots" at nexus + "content/repositories/snapshots")
  else
Some("releases"  at nexus + "content/repositories/releases")

}



> On Sep 22, 2015, at 13:26, Bin Wang  wrote:
> 
> My project is using sbt (or maven), which need to download dependency from a 
> maven repo. I have my own private maven repo with nexus but I don't know how 
> to push my own build to it, can you give me a hint?
> 
> Mark Hamstra  >于2015年9月22日周二 下午1:25写道:
> Yeah, whoever is maintaining the scripts and snapshot builds has fallen down 
> on the job -- but there is nothing preventing you from checking out 
> branch-1.5 and creating your own build, which is arguably a smarter thing to 
> do anyway.  If I'm going to use a non-release build, then I want the full git 
> commit history of exactly what is in that build readily available, not just 
> somewhat arbitrary JARs.
> 
> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang  > wrote:
> But I cannot find 1.5.1-SNAPSHOT either at 
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>  
> 
> Mark Hamstra  >于2015年9月22日周二 下午12:55写道:
> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The 
> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release 
> candidates and then the 1.5.1 release.
> 
> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang  > wrote:
> I'd like to use some important bug fixes in 1.5 branch and I look for the 
> apache maven host, but don't find any snapshot for 1.5 branch. 
> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>  
> 
> 
> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?
> 
> 



Re: spark dataframe transform JSON to ORC meet “column ambigous exception”

2015-09-12 Thread Fengdong Yu
Hi Ted,
I checked the JSON, there aren't duplicated key in JSON.


Azuryy Yu
Sr. Infrastructure Engineer

cel: 158-0164-9103
wetchat: azuryy


On Sat, Sep 12, 2015 at 5:52 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Is it possible that Canonical_URL occurs more than once in your json ?
>
> Can you check your json input ?
>
> Thanks
>
> On Sat, Sep 12, 2015 at 2:05 AM, Fengdong Yu <fengdo...@everstring.com>
> wrote:
>
>> Hi,
>>
>> I am using spark1.4.1 data frame, read JSON data, then save it to orc.
>> the code is very simple:
>>
>> DataFrame json = sqlContext.read().json(input);
>>
>> json.write().format("orc").save(output);
>>
>> the job failed. what's wrong with this exception? Thanks.
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> Reference 'Canonical_URL' is ambiguous, could be: Canonical_URL#960,
>> Canonical_URL#1010.; at
>> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:279)
>> at
>> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:116)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4$$anonfun$16.apply(Analyzer.scala:350)
>> at
>> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:350)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:341)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>> at
>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>> at 
>> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:108)
>> at
>> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:123)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at scala.collection.immutable.List.foreach(List.scala:318) at
>> scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at
>> scala.collection.AbstractTraversable.map(Traversable.scala:105) at
>> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:122)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
>> scala.collection.Iterator$class.foreach(Iterator.scala:727) at
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>> at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
>> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:127)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8.applyOrElse(Analyzer.scala:341)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8.applyOrElse(Analyzer.scala:243)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
>> at
>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285)
>> at
>&g