Re: Calculate sum of values in 2nd element of tuple

2016-01-03 Thread Roberto Congiu
For the first one,

 input.map { case(x,l) => (x, l.reduce(_ + _) ) }

will do what you need.
For the second, yes, there's a difference, one is a List the other is a
Tuple. See for instance
See for instance
val a = (1,2,3)
a.getClass.getName
res4: String = scala.Tuple3

You should look up tuples in the Scala doc as they are not specific to
spark, in particular read up about case classes and pattern matching.


2016-01-03 12:00 GMT-08:00 jimitkr :

> Hi,
>
> I've created tuples of type (String, List[Int]) and want to sum the values
> in the List[Int] part, i.e. the 2nd element in each tuple.
>
> Here is my list
> /  val
> input=sc.parallelize(List(("abc",List(1,2,3,4)),("def",List(5,6,7,8/
>
> I want to sum up values in the 2nd element of the tuple so that the output
> is
> (abc,10)
> (def, 26)
>
> I've tried fold, reduce, foldLeft but with no success in my below code to
> calculate total:
> /val valuesForDEF=input.lookup("def")
> val totalForDEF: Int = valuesForDEF.toList.reduce((x: Int,y: Int)=>x+y)
> println("THE TOTAL FOR DEF IS" + totalForDEF)/
>
> How do i calculate the total?
>
> Another query. What will be the difference between the following tuples
> when
> created:
> /  val
> input=sc.parallelize(List(("abc",List(1,2,3,4)),("def",List(5,6,7,8/
> /  val input=sc.parallelize(List(("abc",(1,2,3,4)),("def",(5,6,7,8/
>
> Is there a difference in how (1,2,3,4) and List(1,2,3,4) is handled?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Calculate-sum-of-values-in-2nd-element-of-tuple-tp25865.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
--
"Good judgment comes from experience.
Experience comes from bad judgment"
--


Re: Best practices to handle corrupted records

2015-10-15 Thread Roberto Congiu
I came to a similar solution to a similar problem. I deal with a lot of CSV
files from many different sources and they are often malformed.
HOwever, I just have success/failure. Maybe you should  make
SuccessWithWarnings a subclass of success, or getting rid of it altogether
making the warnings optional.
I was thinking of making this cleaning/conforming library open source if
you're interested.

R.

2015-10-15 5:28 GMT-07:00 Antonio Murgia :

> Hello,
> I looked around on the web and I couldn’t find any way to deal in a
> structured way with malformed/faulty records during computation. All I was
> able to find was the flatMap/Some/None technique + logging.
> I’m facing this problem because I have a processing algorithm that
> extracts more than one value from each record, but can fail in extracting
> one of those multiple values, and I want to keep track of them. Logging is
> not feasible because this “warning” happens so frequently that the logs
> would become overwhelming and impossibile to read.
> Since I have 3 different possible outcomes from my processing I modeled it
> with this class hierarchy:
> That holds result and/or warnings.
> Since Result implements Traversable it can be used in a flatMap,
> discarding all warnings and failure results, in the other hand, if we want
> to keep track of warnings, we can elaborate them and output them if we need.
>
> Kind Regards
> #A.M.
>



-- 
--
"Good judgment comes from experience.
Experience comes from bad judgment"
--


Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Roberto Congiu
If HDFS is on a linux VM, you could also mount it with FUSE and export it
with samba

2015-08-29 2:26 GMT-07:00 Ted Yu yuzhih...@gmail.com:

 See
 https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HdfsNfsGateway.html

 FYI

 On Sat, Aug 29, 2015 at 1:04 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 You can also mount HDFS through the NFS gateway and access i think.

 Thanks
 Best Regards

 On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:

 http://hortonworks.com/blog/windows-explorer-experience-hdfs/

 Seemed to exist, now now sign.

 Anything similar to tie HDFS into windows explorer?

 Thanks,



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Where is Redgate's HDFS explorer?

2015-08-29 Thread Roberto Congiu
It depends, if HDFS is running under windows, FUSE won't work, but if HDFS
is on a linux VM, Box, or cluster, then you can have the linux box/vm mount
HDFS through FUSE and at the same time export its mount point on samba. At
that point, your windows machine can just connect to the samba share.
R.

2015-08-29 4:04 GMT-07:00 Dino Fancellu d...@felstar.com:

 I'm using Windows.

 Are you saying it works with Windows?

 Dino.

 On 29 August 2015 at 09:04, Akhil Das ak...@sigmoidanalytics.com wrote:
  You can also mount HDFS through the NFS gateway and access i think.
 
  Thanks
  Best Regards
 
  On Tue, Aug 25, 2015 at 3:43 AM, Dino Fancellu d...@felstar.com wrote:
 
  http://hortonworks.com/blog/windows-explorer-experience-hdfs/
 
  Seemed to exist, now now sign.
 
  Anything similar to tie HDFS into windows explorer?
 
  Thanks,
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Where-is-Redgate-s-HDFS-explorer-tp24431.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Roberto Congiu
Port 8020 is not the only port you need tunnelled for HDFS to work. If you
only list the contents of a directory, port 8020 is enough... for instance,
using something

val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/)
val fs = p.getFileSystem(sc.hadoopConfiguration)
fs.listStatus(p)

you should see the file list.
But then, when accessing a file, you need to actually get its blocks, it
has to connect to the data node.
The error 'could not obtain block' means it can't get that block from the
DataNode.
Refer to
http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.1/bk_reference/content/reference_chap2_1.html
to see the complete list of ports that also need to be tunnelled.



2015-08-24 13:10 GMT-07:00 Dino Fancellu d...@felstar.com:

 Changing the ip to the guest IP address just never connects.

 The VM has port tunnelling, and it passes through all the main ports,
 8020 included to the host VM.

 You can tell that it was talking to the guest VM before, simply
 because it said when file not found

 Error is:

 Exception in thread main org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most
 recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
 org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
 BP-452094660-10.0.2.15-1437494483194:blk_1073742905_2098
 file=/tmp/people.txt

 but I have no idea what it means by that. It certainly can find the
 file and knows it exists.



 On 24 August 2015 at 20:43, Roberto Congiu roberto.con...@gmail.com
 wrote:
  When you launch your HDP guest VM, most likely it gets launched with NAT
 and
  an address on a private network (192.168.x.x) so on your windows host you
  should use that address (you can find out using ifconfig on the guest
 OS).
  I usually add an entry to my /etc/hosts for VMs that I use oftenif
 you
  use vagrant, there's also a vagrant module that can do that
 automatically.
  Also, I am not sure how the default HDP VM is set up, that is, if it only
  binds HDFS to 127.0.0.1 or to all addresses. You can check that with
 netstat
  -a.
 
  R.
 
  2015-08-24 11:46 GMT-07:00 Dino Fancellu d...@felstar.com:
 
  I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.
 
  If I go into the guest spark-shell and refer to the file thus, it works
  fine
 
val words=sc.textFile(hdfs:///tmp/people.txt)
words.count
 
  However if I try to access it from a local Spark app on my Windows host,
  it
  doesn't work
 
val conf = new SparkConf().setMaster(local).setAppName(My App)
val sc = new SparkContext(conf)
 
val words=sc.textFile(hdfs://localhost:8020/tmp/people.txt)
words.count
 
  Emits
 
 
 
  The port 8020 is open, and if I choose the wrong file name, it will tell
  me
 
 
 
  My pom has
 
  dependency
  groupIdorg.apache.spark/groupId
  artifactIdspark-core_2.11/artifactId
  version1.4.1/version
  scopeprovided/scope
  /dependency
 
  Am I doing something wrong?
 
  Thanks.
 
 
 
 
  --
  View this message in context:
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Local-Spark-talking-to-remote-HDFS-tp24425.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 



Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Roberto Congiu
That's what I'd suggest too. Furthermore, if you use vagrant to spin up
VMs, there's a module that can do that automatically for you.

R.

2015-08-25 10:11 GMT-07:00 Steve Loughran ste...@hortonworks.com:

 I wouldn't try to play with forwarding  tunnelling; always hard to work
 out what ports get used everywhere, and the services like hostname==URL in
 paths.

 Can't you just set up an entry in the windows /etc/hosts file? It's what I
 do (on Unix) to talk to VMs


  On 25 Aug 2015, at 04:49, Dino Fancellu d...@felstar.com wrote:
 
  Tried adding 50010, 50020 and 50090. Still no difference.
 
  I can't imagine I'm the only person on the planet wanting to do this.
 
  Anyway, thanks for trying to help.
 
  Dino.
 
  On 25 August 2015 at 08:22, Roberto Congiu roberto.con...@gmail.com
 wrote:
  Port 8020 is not the only port you need tunnelled for HDFS to work. If
 you
  only list the contents of a directory, port 8020 is enough... for
 instance,
  using something
 
  val p = new org.apache.hadoop.fs.Path(hdfs://localhost:8020/)
  val fs = p.getFileSystem(sc.hadoopConfiguration)
  fs.listStatus(p)
 
  you should see the file list.
  But then, when accessing a file, you need to actually get its blocks,
 it has
  to connect to the data node.
  The error 'could not obtain block' means it can't get that block from
 the
  DataNode.
  Refer to
 
 http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.1/bk_reference/content/reference_chap2_1.html
  to see the complete list of ports that also need to be tunnelled.
 
 
 
  2015-08-24 13:10 GMT-07:00 Dino Fancellu d...@felstar.com:
 
  Changing the ip to the guest IP address just never connects.
 
  The VM has port tunnelling, and it passes through all the main ports,
  8020 included to the host VM.
 
  You can tell that it was talking to the guest VM before, simply
  because it said when file not found
 
  Error is:
 
  Exception in thread main org.apache.spark.SparkException: Job
  aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most
  recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost):
  org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block:
  BP-452094660-10.0.2.15-1437494483194:blk_1073742905_2098
  file=/tmp/people.txt
 
  but I have no idea what it means by that. It certainly can find the
  file and knows it exists.
 
 
 
  On 24 August 2015 at 20:43, Roberto Congiu roberto.con...@gmail.com
  wrote:
  When you launch your HDP guest VM, most likely it gets launched with
 NAT
  and
  an address on a private network (192.168.x.x) so on your windows host
  you
  should use that address (you can find out using ifconfig on the guest
  OS).
  I usually add an entry to my /etc/hosts for VMs that I use oftenif
  you
  use vagrant, there's also a vagrant module that can do that
  automatically.
  Also, I am not sure how the default HDP VM is set up, that is, if it
  only
  binds HDFS to 127.0.0.1 or to all addresses. You can check that with
  netstat
  -a.
 
  R.
 
  2015-08-24 11:46 GMT-07:00 Dino Fancellu d...@felstar.com:
 
  I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.
 
  If I go into the guest spark-shell and refer to the file thus, it
 works
  fine
 
   val words=sc.textFile(hdfs:///tmp/people.txt)
   words.count
 
  However if I try to access it from a local Spark app on my Windows
  host,
  it
  doesn't work
 
   val conf = new SparkConf().setMaster(local).setAppName(My App)
   val sc = new SparkContext(conf)
 
   val words=sc.textFile(hdfs://localhost:8020/tmp/people.txt)
   words.count
 
  Emits
 
 
 
  The port 8020 is open, and if I choose the wrong file name, it will
  tell
  me
 
 
 
  My pom has
 
 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-core_2.11/artifactId
 version1.4.1/version
 scopeprovided/scope
 /dependency
 
  Am I doing something wrong?
 
  Thanks.
 
 
 
 
  --
  View this message in context:
 
 
 http://apache-spark-user-list.1001560.n3.nabble.com/Local-Spark-talking-to-remote-HDFS-tp24425.html
  Sent from the Apache Spark User List mailing list archive at
  Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 




Re: Local Spark talking to remote HDFS?

2015-08-24 Thread Roberto Congiu
When you launch your HDP guest VM, most likely it gets launched with NAT
and an address on a private network (192.168.x.x) so on your windows host
you should use that address (you can find out using ifconfig on the guest
OS).
I usually add an entry to my /etc/hosts for VMs that I use oftenif you
use vagrant, there's also a vagrant module that can do that automatically.
Also, I am not sure how the default HDP VM is set up, that is, if it only
binds HDFS to 127.0.0.1 or to all addresses. You can check that with
netstat -a.

R.

2015-08-24 11:46 GMT-07:00 Dino Fancellu d...@felstar.com:

 I have a file in HDFS inside my HortonWorks HDP 2.3_1 VirtualBox VM.

 If I go into the guest spark-shell and refer to the file thus, it works
 fine

   val words=sc.textFile(hdfs:///tmp/people.txt)
   words.count

 However if I try to access it from a local Spark app on my Windows host, it
 doesn't work

   val conf = new SparkConf().setMaster(local).setAppName(My App)
   val sc = new SparkContext(conf)

   val words=sc.textFile(hdfs://localhost:8020/tmp/people.txt)
   words.count

 Emits



 The port 8020 is open, and if I choose the wrong file name, it will tell me



 My pom has

 dependency
 groupIdorg.apache.spark/groupId
 artifactIdspark-core_2.11/artifactId
 version1.4.1/version
 scopeprovided/scope
 /dependency

 Am I doing something wrong?

 Thanks.




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Local-Spark-talking-to-remote-HDFS-tp24425.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: SPARK sql :Need JSON back isntead of roq

2015-08-21 Thread Roberto Congiu
2015-08-21 3:17 GMT-07:00 smagadi sudhindramag...@fico.com:

 teenagers .toJSON gives the json but it does not preserve the parent ids

 meaning if the input was {name:Yin,
 address:{city:Columbus,state:Ohio},age:20}

 val x= sqlContext.sql(SELECT name, address.city, address.state ,age FROM
 people where age19 and age =30 ).toJSON

  x.collect().foreach(println)

 This returns back , missing address.
 {name:Yin,city:Columbus,state:Ohio,age:20}
 Is this a bug ?


You're not including it in the query, so of course it's not there.
Try


sqlContext.sql(Select * from ppl).toJSON.collect().foreach(println)

instead.
I get

{address:{city:Columbus,state:Ohio},name:Yin}

R.


Re: Nested DataFrame(SchemaRDD)

2015-06-23 Thread Roberto Congiu
I wrote a brief howto on building nested records in spark and storing them
in parquet here:
http://www.congiu.com/creating-nested-data-parquet-in-spark-sql/

2015-06-23 16:12 GMT-07:00 Richard Catlin richard.m.cat...@gmail.com:

 How do I create a DataFrame(SchemaRDD) with a nested array of Rows in a
 column?  Is there an example?  Will this store as a nested parquet file?

 Thanks.

 Richard Catlin