Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Yes, sbt uses the same structure as maven for source files.

> On Mar 15, 2016, at 1:53 PM, Mich Talebzadeh  
> wrote:
> 
> Thanks the maven structure is identical to sbt. just sbt file I will have to 
> replace with pom.xml
> 
> I will use your pom.xml to start with it.
> 
> Cheers
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 15 March 2016 at 13:12, Chandeep Singh  <mailto:c...@chandeep.com>> wrote:
> You can build using maven from the command line as well.
> 
> This layout should give you an idea and here are some resources - 
> http://www.scala-lang.org/old/node/345 
> <http://www.scala-lang.org/old/node/345>
> 
> project/
>pom.xml   -  Defines the project
>src/
>   main/
>   java/ - Contains all java code that will go in your final artifact. 
>  
>   See maven-compiler-plugin 
> <http://maven.apache.org/plugins/maven-compiler-plugin/> for details
>   scala/ - Contains all scala code that will go in your final 
> artifact.  
>See maven-scala-plugin 
> <http://scala-tools.org/mvnsites/maven-scala-plugin/> for details
>   resources/ - Contains all static files that should be available on 
> the classpath 
>in the final artifact.  See maven-resources-plugin 
> <http://maven.apache.org/plugins/maven-resources-plugin/> for details
>   webapp/ - Contains all content for a web application (jsps, css, 
> images, etc.)  
> See maven-war-plugin 
> <http://maven.apache.org/plugins/maven-war-plugin/> for details
>  site/ - Contains all apt or xdoc files used to create a project website. 
>  
>  See maven-site-plugin 
> <http://maven.apache.org/plugins/maven-site-plugin/> for details   
>  test/
>  java/ - Contains all java code used for testing.   
>  See maven-compiler-plugin 
> <http://maven.apache.org/plugins/maven-compiler-plugin/> for details
>  scala/ - Contains all scala code used for testing.   
>   See maven-scala-plugin 
> <http://scala-tools.org/mvnsites/maven-scala-plugin/> for details
>  resources/ - Contains all static content that should be available on 
> the 
>   classpath during testing.   See maven-resources-plugin 
> <http://maven.apache.org/plugins/maven-resources-plugin/> for details
> 
> 
>> On Mar 15, 2016, at 12:38 PM, Chandeep Singh > <mailto:c...@chandeep.com>> wrote:
>> 
>> Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ 
>> <http://www.eclipse.org/m2e/>
>> 
>> Once you have it setup, File -> New -> Other -> MavenProject -> Next / 
>> Finish. You’ll see a default POM.xml which you can modify / replace. 
>> 
>> 
>> 
>> Here is some documentation that should help: 
>> http://scala-ide.org/docs/tutorials/m2eclipse/ 
>> <http://scala-ide.org/docs/tutorials/m2eclipse/>
>> 
>> I’m using the same Eclipse build as you on my Mac. I mostly build a shaded 
>> JAR and SCP it to the cluster.
>> 
>>> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh >> <mailto:mich.talebza...@gmail.com>> wrote:
>>> 
>>> Great Chandeep. I also have Eclipse Scala IDE below
>>> 
>>> scala IDE build of Eclipse SDK
>>> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
>>> 
>>> I am no expert on Eclipse so if I create project called ImportCSV where do 
>>> I need to put the pom file or how do I reference it please. My Eclipse runs 
>>> on a Linux host so it cab access all the directories that sbt project 
>>> accesses? I also believe there will not be any need for external jar files 
>>> in builkd path?
>>> 
>>> Thanks
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> 
>>> On 15 March 2016 at 12:15, Chandeep Singh >> <mailto:c...@chandeep.com>> wrote:
>>> Btw, just to add to the c

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
You can build using maven from the command line as well.

This layout should give you an idea and here are some resources - 
http://www.scala-lang.org/old/node/345 <http://www.scala-lang.org/old/node/345>

project/
   pom.xml   -  Defines the project
   src/
  main/
  java/ - Contains all java code that will go in your final artifact.  
  See maven-compiler-plugin 
<http://maven.apache.org/plugins/maven-compiler-plugin/> for details
  scala/ - Contains all scala code that will go in your final artifact. 
 
   See maven-scala-plugin 
<http://scala-tools.org/mvnsites/maven-scala-plugin/> for details
  resources/ - Contains all static files that should be available on 
the classpath 
   in the final artifact.  See maven-resources-plugin 
<http://maven.apache.org/plugins/maven-resources-plugin/> for details
  webapp/ - Contains all content for a web application (jsps, css, 
images, etc.)  
See maven-war-plugin 
<http://maven.apache.org/plugins/maven-war-plugin/> for details
 site/ - Contains all apt or xdoc files used to create a project website.  
 See maven-site-plugin 
<http://maven.apache.org/plugins/maven-site-plugin/> for details   
 test/
 java/ - Contains all java code used for testing.   
 See maven-compiler-plugin 
<http://maven.apache.org/plugins/maven-compiler-plugin/> for details
 scala/ - Contains all scala code used for testing.   
  See maven-scala-plugin 
<http://scala-tools.org/mvnsites/maven-scala-plugin/> for details
 resources/ - Contains all static content that should be available on 
the 
  classpath during testing.   See maven-resources-plugin 
<http://maven.apache.org/plugins/maven-resources-plugin/> for details


> On Mar 15, 2016, at 12:38 PM, Chandeep Singh  wrote:
> 
> Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ 
> <http://www.eclipse.org/m2e/>
> 
> Once you have it setup, File -> New -> Other -> MavenProject -> Next / 
> Finish. You’ll see a default POM.xml which you can modify / replace. 
> 
> 
> 
> Here is some documentation that should help: 
> http://scala-ide.org/docs/tutorials/m2eclipse/ 
> <http://scala-ide.org/docs/tutorials/m2eclipse/>
> 
> I’m using the same Eclipse build as you on my Mac. I mostly build a shaded 
> JAR and SCP it to the cluster.
> 
>> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh > <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Great Chandeep. I also have Eclipse Scala IDE below
>> 
>> scala IDE build of Eclipse SDK
>> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
>> 
>> I am no expert on Eclipse so if I create project called ImportCSV where do I 
>> need to put the pom file or how do I reference it please. My Eclipse runs on 
>> a Linux host so it cab access all the directories that sbt project accesses? 
>> I also believe there will not be any need for external jar files in builkd 
>> path?
>> 
>> Thanks
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> 
>> On 15 March 2016 at 12:15, Chandeep Singh > <mailto:c...@chandeep.com>> wrote:
>> Btw, just to add to the confusion ;) I use Maven as well since I moved from 
>> Java to Scala but everyone I talk to has been recommending SBT for Scala. 
>> 
>> I use the Eclipse Scala IDE to build. http://scala-ide.org/ 
>> <http://scala-ide.org/>
>> 
>> Here is my sample PoM. You can add dependancies based on your requirement.
>> 
>> http://maven.apache.org/POM/4.0.0 
>> <http://maven.apache.org/POM/4.0.0>" 
>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance 
>> <http://www.w3.org/2001/XMLSchema-instance>"
>>  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
>> <http://maven.apache.org/POM/4.0.0> http://maven.apache.org/maven-v4_0_0.xsd 
>> <http://maven.apache.org/maven-v4_0_0.xsd>">
>>  4.0.0
>>  spark
>>  1.0
>>  ${project.artifactId}
>> 
>>  
>>  1.7
>>  1.7
>>  UTF-8
>>  2.10.4
>>  2.15.2
>>  
>> 
>>  
>>  
>>  clou

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Do you have the Eclipse Maven plugin setup? http://www.eclipse.org/m2e/ 
<http://www.eclipse.org/m2e/>

Once you have it setup, File -> New -> Other -> MavenProject -> Next / Finish. 
You’ll see a default POM.xml which you can modify / replace. 



Here is some documentation that should help: 
http://scala-ide.org/docs/tutorials/m2eclipse/ 
<http://scala-ide.org/docs/tutorials/m2eclipse/>

I’m using the same Eclipse build as you on my Mac. I mostly build a shaded JAR 
and SCP it to the cluster.

> On Mar 15, 2016, at 12:22 PM, Mich Talebzadeh  
> wrote:
> 
> Great Chandeep. I also have Eclipse Scala IDE below
> 
> scala IDE build of Eclipse SDK
> Build id: 4.3.0-vfinal-2015-12-01T15:55:22Z-Typesafe
> 
> I am no expert on Eclipse so if I create project called ImportCSV where do I 
> need to put the pom file or how do I reference it please. My Eclipse runs on 
> a Linux host so it cab access all the directories that sbt project accesses? 
> I also believe there will not be any need for external jar files in builkd 
> path?
> 
> Thanks
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 15 March 2016 at 12:15, Chandeep Singh  <mailto:c...@chandeep.com>> wrote:
> Btw, just to add to the confusion ;) I use Maven as well since I moved from 
> Java to Scala but everyone I talk to has been recommending SBT for Scala. 
> 
> I use the Eclipse Scala IDE to build. http://scala-ide.org/ 
> <http://scala-ide.org/>
> 
> Here is my sample PoM. You can add dependancies based on your requirement.
> 
> http://maven.apache.org/POM/4.0.0 
> <http://maven.apache.org/POM/4.0.0>" 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance 
> <http://www.w3.org/2001/XMLSchema-instance>"
>   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
> <http://maven.apache.org/POM/4.0.0> http://maven.apache.org/maven-v4_0_0.xsd 
> <http://maven.apache.org/maven-v4_0_0.xsd>">
>   4.0.0
>   spark
>   1.0
>   ${project.artifactId}
> 
>   
>   1.7
>   1.7
>   UTF-8
>   2.10.4
>   2.15.2
>   
> 
>   
>   
>   cloudera-repo-releases
>   
> https://repository.cloudera.com/artifactory/repo/ 
> <https://repository.cloudera.com/artifactory/repo/%3C/url%3E>
>   
>   
> 
>   
>   
>   org.scala-lang
>   scala-library
>   ${scala.version}
>   
>   
>   org.apache.spark
>   spark-core_2.10
>   1.5.0-cdh5.5.1
>   
>   
>   org.apache.spark
>   spark-mllib_2.10
>   1.5.0-cdh5.5.1
>   
>   
>   org.apache.spark
>   spark-hive_2.10
>   1.5.0
>   
> 
>   
>   
>   src/main/scala
>   src/test/scala
>   
>   
>   org.scala-tools
>   maven-scala-plugin
>   ${maven-scala-plugin.version}
>   
>   
>   
>   compile
>   testCompile
>   
>   
>   
>   
>   
>   -Xms64m
>   -Xmx1024m
>   
>  

Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
> #
> # Procedure:generic.ksh
> #
> # Description:  Compiles and run scala app usinbg sbt and spark-submit
> #
> # Parameters:   none
> #
> #
> # Vers|  Date  | Who | DA | Description
> #-++-++-
> # 1.0 |04/03/15|  MT || Initial Version
> #
> #
> function F_USAGE
> {
>echo "USAGE: ${1##*/} -A ''"
>echo "USAGE: ${1##*/} -H '' -h ''"
>exit 10
> }
> #
> # Main Section
> #
> if [[ "${1}" = "-h" || "${1}" = "-H" ]]; then
>F_USAGE $0
> fi
> ## MAP INPUT TO VARIABLES
> while getopts A: opt
> do
>case $opt in
>(A) APPLICATION="$OPTARG" ;;
>(*) F_USAGE $0 ;;
>esac
> done
> [[ -z ${APPLICATION} ]] && print "You must specify an application value " && 
> F_USAGE $0
> ENVFILE=/home/hduser/dba/bin/environment.ksh
> if [[ -f $ENVFILE ]]
> then
> . $ENVFILE
> . ~/spark_1.5.2_bin-hadoop2.6.kshrc
> else
> echo "Abort: $0 failed. No environment file ( $ENVFILE ) found"
> exit 1
> fi
> ##FILE_NAME=`basename $0 .ksh`
> FILE_NAME=${APPLICATION}
> CLASS=`echo ${FILE_NAME}|tr "[:upper:]" "[:lower:]"`
> NOW="`date +%Y%m%d_%H%M`"
> LOG_FILE=${LOGDIR}/${FILE_NAME}.log
> [ -f ${LOG_FILE} ] && rm -f ${LOG_FILE}
> print "\n" `date` ", Started $0" | tee -a ${LOG_FILE}
> cd ../${FILE_NAME}
> print "Compiling ${FILE_NAME}" | tee -a ${LOG_FILE}
> sbt package
> print "Submiiting the job" | tee -a ${LOG_FILE}
> 
> ${SPARK_HOME}/bin/spark-submit \
> --packages com.databricks:spark-csv_2.11:1.3.0 \
> --class "${FILE_NAME}" \
> --master spark://50.140.197.217:7077 
> <http://50.140.197.217:7077/> \
> --executor-memory=12G \
> --executor-cores=12 \
> --num-executors=2 \
> target/scala-2.10/${CLASS}_2.10-1.0.jar
> print `date` ", Finished $0" | tee -a ${LOG_FILE}
> exit
> 
> 
> So to run it for ImportCSV all I need is to do
> 
> ./generic.ksh -A ImportCSV
> 
> Now can anyone kindly give me a rough guideline on directory and location of 
> pom.xml to make this work using maven?
> 
> Thanks
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 15 March 2016 at 10:50, Sean Owen  <mailto:so...@cloudera.com>> wrote:
> FWIW, I strongly prefer Maven over SBT even for Scala projects. The
> Spark build of reference is Maven.
> 
> On Tue, Mar 15, 2016 at 10:45 AM, Chandeep Singh  <mailto:c...@chandeep.com>> wrote:
> > For Scala, SBT is recommended.
> >
> > On Mar 15, 2016, at 10:42 AM, Mich Talebzadeh  > <mailto:mich.talebza...@gmail.com>>
> > wrote:
> >
> > Hi,
> >
> > I build my Spark/Scala packages using SBT that works fine. I have created
> > generic shell scripts to build and submit it.
> >
> > Yesterday I noticed that some use Maven and Pom for this purpose.
> >
> > Which approach is recommended?
> >
> > Thanks,
> >
> >
> > Dr Mich Talebzadeh
> >
> >
> >
> > LinkedIn
> > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >  
> > <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
> >
> >
> >
> > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
> >
> >
> >
> >
> 



Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
Puled this from stack overflow:

We're using Maven to build Scala projects at work because it integrates well 
with our CI server. We could just run a shell script to kick off a build, of 
course, but we've got a bunch of other information coming out of Maven that we 
want to go into CI. That's about the only reason I can think of to use Maven 
for a Scala project.

Otherwise, just use SBT. You get access to the same dependencies (really the 
best part about maven, IMHO). You also get the incremental compilation, which 
is huge. The ability to start up a shell inside of your project, which is also 
great.

ScalaMock only works with SBT, and you're probably going to want to use that 
rather than a Java mocking library. On top of that, it's much easier to extend 
SBT since you can write full scala code in the build file, so you don't have to 
go through all the rigamarole of writing a Mojo.

In short, just use SBT unless you really need tight integration into your CI 
server. 

http://stackoverflow.com/questions/11277967/pros-and-cons-of-using-sbt-vs-maven-in-scala-project
 
<http://stackoverflow.com/questions/11277967/pros-and-cons-of-using-sbt-vs-maven-in-scala-project>


> On Mar 15, 2016, at 10:45 AM, Chandeep Singh  wrote:
> 
> For Scala, SBT is recommended.
> 
>> On Mar 15, 2016, at 10:42 AM, Mich Talebzadeh > <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> I build my Spark/Scala packages using SBT that works fine. I have created 
>> generic shell scripts to build and submit it.
>> 
>> Yesterday I noticed that some use Maven and Pom for this purpose.
>> 
>> Which approach is recommended?
>> 
>> Thanks,
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
> 



Re: Building Spark packages with SBTor Maven

2016-03-15 Thread Chandeep Singh
For Scala, SBT is recommended.

> On Mar 15, 2016, at 10:42 AM, Mich Talebzadeh  
> wrote:
> 
> Hi,
> 
> I build my Spark/Scala packages using SBT that works fine. I have created 
> generic shell scripts to build and submit it.
> 
> Yesterday I noticed that some use Maven and Pom for this purpose.
> 
> Which approach is recommended?
> 
> Thanks,
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  



Re: Steps to Run Spark Scala job from Oozie on EC2 Hadoop clsuter

2016-03-07 Thread Chandeep Singh
As a work around you could put your spark-submit statement in a shell script 
and then use Oozie’s SSH action to execute that script.

> On Mar 7, 2016, at 3:58 PM, Neelesh Salian  wrote:
> 
> Hi Divya,
> 
> This link should have the details that you need to begin using the Spark 
> Action on Oozie:
> https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html 
> 
> 
> Thanks.
> 
> On Mon, Mar 7, 2016 at 7:52 AM, Benjamin Kim  > wrote:
> To comment…
> 
> At my company, we have not gotten it to work in any other mode than local. If 
> we try any of the yarn modes, it fails with a “file does not exist” error 
> when trying to locate the executable jar. I mentioned this to the Hue users 
> group, which we used for this, and they replied that the Spark Action is very 
> basic implementation and that they will be writing their own for production 
> use.
> 
> That’s all I know...
> 
>> On Mar 7, 2016, at 1:18 AM, Deepak Sharma > > wrote:
>> 
>> There is Spark action defined for oozie workflows.
>> Though I am not sure if it supports only Java SPARK jobs or Scala jobs as 
>> well.
>> https://oozie.apache.org/docs/4.2.0/DG_SparkActionExtension.html 
>> 
>> Thanks
>> Deepak
>> 
>> On Mon, Mar 7, 2016 at 2:44 PM, Divya Gehlot > > wrote:
>> Hi,
>> 
>> Could somebody help me by providing the steps /redirect me  to 
>> blog/documentation on how to run Spark job written in scala through Oozie.
>> 
>> Would really appreciate the help.
>> 
>> 
>> 
>> Thanks,
>> Divya 
>> 
>> 
>> 
>> -- 
>> Thanks
>> Deepak
>> www.bigdatabig.com 
>> www.keosha.net 
> 
> 
> 
> -- 
> Neelesh Srinivas Salian
> Customer Operations Engineer
> 
> 
> 



Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
#3 If your code is dependent on other projects you will need to package 
everything together in order to distribute over a Spark cluster.

In your example below I don’t see much of an advantage by building a package.

> On Mar 5, 2016, at 12:32 AM, Ted Yu  wrote:
> 
> Answers to first two questions are 'yes'
> 
> Not clear on what the 3rd question is asking.
> 
> On Fri, Mar 4, 2016 at 4:28 PM, Mich Talebzadeh  <mailto:mich.talebza...@gmail.com>> wrote:
> Thanks now all working. Also select from  tmp tables are part of sqlContext 
> not HiveContext
> 
> This is the final code that works in blue
> 
> 
> Couple of questions if I may
> 
> This works pretty effortless in spark-shell. Is this because $CLASSPATH 
> already includes all the needed jars?
> The import section. That imports the needed classes. So basically import 
> org.apache.spark.sql.functions._ imports all the methods of Class functions?
> What is the reason why we should use sbt to build custom jars from a spark 
> code as opposed to running the code against spark shell in a file? Any 
> particular use case for it?
> 
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.sql.functions._
> //
> object Sequence {
>   def main(args: Array[String]) {
>   val conf = new 
> SparkConf().setAppName("Sequence").setMaster("local[*]").set("spark.driver.allowMultipleContexts",
>  "true")
>   val sc = new SparkContext(conf)
>   // Note that this should be done only after an instance of 
> org.apache.spark.sql.SQLContext is created. It should be written as:
>   val sqlContext= new org.apache.spark.sql.SQLContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16))
>   // Sort option 1 using tempTable
>   val b = a.toDF("Name","score").registerTempTable("tmp")
>   sqlContext.sql("select Name,score from tmp order by score desc").show
>   // Sort option 2 with FP
>   a.toDF("Name","score").sort(desc("score")).show
>  }
> }
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 4 March 2016 at 23:58, Chandeep Singh  <mailto:c...@chandeep.com>> wrote:
> That is because an instance of org.apache.spark.sql.SQLContext doesn’t exist 
> in the current context and is required before you can use any of its implicit 
> methods.
> 
> As Ted mentioned importing org.apache.spark.sql.functions._ will take care of 
> the below error.
> 
> 
>> On Mar 4, 2016, at 11:35 PM, Mich Talebzadeh > <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> thanks. It is like war of attrition. I always thought that you add  import 
>> before the class itself not within the class? w3hat is the reason for it 
>> please?
>> 
>> this is my code
>> 
>> import org.apache.spark.SparkContext
>> import org.apache.spark.SparkConf
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.sql.types._
>> import org.apache.spark.sql.SQLContext
>> //
>> object Sequence {
>>   def main(args: Array[String]) {
>>   val conf = new 
>> SparkConf().setAppName("Sequence").setMaster("local[*]").set("spark.driver.allowMultipleContexts",
>>  "true")
>>   val sc = new SparkContext(conf)
>>   // Note that this should be done only after an instance of 
>> org.apache.spark.sql.SQLContext is created. It should be written as:
>>   val sqlContext= new org.apache.spark.sql.SQLContext(sc)
>>   import sqlContext.implicits._
>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16))
>>   // Sort option 1 using tempTable
>>   val b = a.toDF("Name","score").registerTempTable("tmp")
>>   HiveContext.sql("select Name,score from tmp order by score desc").show
>>   // Sort option 2 with FP
>>   a.to

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
That is because an instance of org.apache.spark.sql.SQLContext doesn’t exist in 
the current context and is required before you can use any of its implicit 
methods.

As Ted mentioned importing org.apache.spark.sql.functions._ will take care of 
the below error.

> On Mar 4, 2016, at 11:35 PM, Mich Talebzadeh  
> wrote:
> 
> thanks. It is like war of attrition. I always thought that you add  import 
> before the class itself not within the class? w3hat is the reason for it 
> please?
> 
> this is my code
> 
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> //
> object Sequence {
>   def main(args: Array[String]) {
>   val conf = new 
> SparkConf().setAppName("Sequence").setMaster("local[*]").set("spark.driver.allowMultipleContexts",
>  "true")
>   val sc = new SparkContext(conf)
>   // Note that this should be done only after an instance of 
> org.apache.spark.sql.SQLContext is created. It should be written as:
>   val sqlContext= new org.apache.spark.sql.SQLContext(sc)
>   import sqlContext.implicits._
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16))
>   // Sort option 1 using tempTable
>   val b = a.toDF("Name","score").registerTempTable("tmp")
>   HiveContext.sql("select Name,score from tmp order by score desc").show
>   // Sort option 2 with FP
>   a.toDF("Name","score").sort(desc("score")).show
>  }
> }
> 
> And now the last failure is in
> 
> info]  [SUCCESSFUL ] org.scala-lang#jline;2.10.5!jline.jar (104ms)
> [info] Done updating.
> [info] Compiling 1 Scala source to 
> /home/hduser/dba/bin/scala/Sequence/target/scala-2.10/classes...
> [info] 'compiler-interface' not yet compiled for Scala 2.10.5. Compiling...
> [info]   Compilation completed in 15.779 s
> [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:21: 
> not found: value desc
> [error]   a.toDF("Name","score").sort(desc("score")).show
> [error]   ^
> [error] one error found
> [error] (compile:compileIncremental) Compilation failed
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> 
> On 4 March 2016 at 23:25, Chandeep Singh  <mailto:c...@chandeep.com>> wrote:
> This is what you need:
> 
> val sc = new SparkContext(sparkConf)
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext.implicits._
> 
>> On Mar 4, 2016, at 11:03 PM, Mich Talebzadeh > <mailto:mich.talebza...@gmail.com>> wrote:
>> 
>> Hi Ted,
>> 
>> This is my code
>> 
>> import org.apache.spark.SparkConf
>> import org.apache.spark.sql.Row
>> import org.apache.spark.sql.hive.HiveContext
>> import org.apache.spark.sql.types._
>> import org.apache.spark.sql.SQLContext
>> //
>> object Sequence {
>>   def main(args: Array[String]) {
>>   val conf = new 
>> SparkConf().setAppName("Sequence").setMaster("local[*]").set("spark.driver.allowMultipleContexts",
>>  "true")
>>   val sc = new SparkContext(conf)
>>   val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>   val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16))
>>   // Sort option 1 using tempTable
>>   val b = a.toDF("Name","score").registerTempTable("tmp")
>>   sql("select Name,score from tmp order by score desc").show
>>   // Sort option 2 with FP
>>   a.toDF("Name","score").sort(desc("score")).show
>>  }
>> }
>> 
>> And the error I am getting now is
>> 
>> [info] downloading 
>> https://repo1.maven.org/maven2/org/scala-lang/jline/2.10.5/jline-2.10.5.jar 
>> <https://repo1.maven.org/maven2/org/scala-lang/jline/2.10.5/jline-2.10.5.jar>
>>  ...
>> [info]  [SUCCESSFUL ] org.scala-lang#jline;2.10.5!jline.jar (103ms)
>> [info] Don

Re: Error building a self contained Spark app

2016-03-04 Thread Chandeep Singh
This is what you need:

val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

> On Mar 4, 2016, at 11:03 PM, Mich Talebzadeh  
> wrote:
> 
> Hi Ted,
> 
> This is my code
> 
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.hive.HiveContext
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.SQLContext
> //
> object Sequence {
>   def main(args: Array[String]) {
>   val conf = new 
> SparkConf().setAppName("Sequence").setMaster("local[*]").set("spark.driver.allowMultipleContexts",
>  "true")
>   val sc = new SparkContext(conf)
>   val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16))
>   // Sort option 1 using tempTable
>   val b = a.toDF("Name","score").registerTempTable("tmp")
>   sql("select Name,score from tmp order by score desc").show
>   // Sort option 2 with FP
>   a.toDF("Name","score").sort(desc("score")).show
>  }
> }
> 
> And the error I am getting now is
> 
> [info] downloading 
> https://repo1.maven.org/maven2/org/scala-lang/jline/2.10.5/jline-2.10.5.jar 
>  
> ...
> [info]  [SUCCESSFUL ] org.scala-lang#jline;2.10.5!jline.jar (103ms)
> [info] Done updating.
> [info] Compiling 1 Scala source to 
> /home/hduser/dba/bin/scala/Sequence/target/scala-2.10/classes...
> [info] 'compiler-interface' not yet compiled for Scala 2.10.5. Compiling...
> [info]   Compilation completed in 12.462 s
> [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:16: 
> value toDF is not a member of Seq[(String, Int)]
> [error]   val b = a.toDF("Name","score").registerTempTable("tmp")
> [error] ^
> [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:17: 
> not found: value sql
> [error]   sql("select Name,score from tmp order by score desc").show
> [error]   ^
> [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:19: 
> value toDF is not a member of Seq[(String, Int)]
> [error]   a.toDF("Name","score").sort(desc("score")).show
> [error] ^
> [error] three errors found
> [error] (compile:compileIncremental) Compilation failed
> [error] Total time: 88 s, completed Mar 4, 2016 11:12:46 PM
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 4 March 2016 at 22:52, Ted Yu  > wrote:
> Can you show your code snippet ?
> Here is an example:
> 
>   val sqlContext = new SQLContext(sc)
>   import sqlContext.implicits._
> 
> On Fri, Mar 4, 2016 at 1:55 PM, Mich Talebzadeh  > wrote:
> Hi Ted,
> 
>  I am getting the following error after adding that import
> 
> [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:5: 
> not found: object sqlContext
> [error] import sqlContext.implicits._
> [error]^
> [error] /home/hduser/dba/bin/scala/Sequence/src/main/scala/Sequence.scala:15: 
> value toDF is not a member of Seq[(String, Int)]
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> 
> On 4 March 2016 at 21:39, Ted Yu  > wrote:
> Can you add the following into your code ?
>  import sqlContext.implicits._
> 
> On Fri, Mar 4, 2016 at 1:14 PM, Mich Talebzadeh  > wrote:
> Hi,
> 
> I have a simple Scala program as below
> 
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.SparkConf
> import org.apache.spark.sql.SQLContext
> object Sequence {
>   def main(args: Array[String]) {
>   val conf = new SparkConf().setAppName("Sequence")
>   val sc = new SparkContext(conf)
>   val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>   val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>   val a = Seq(("Mich",20), ("Christian", 18), ("James",13), ("Richard",16))
>   // Sort option 1 using tempTable
>   val b = a.toDF("Name","score").registerTempTable("tmp")
>   sql("select Name,score from tmp order by score desc").show
>   // Sort option 2 with FP
>   a.toDF("Name","score").sort(desc("score")).show
>  }
> }
> 
> I build this using sbt tool as below
> 
>  cat sequence.sbt
> name := "Sequence"
> version := "1.0"
> scalaVersion := "2.10.5"
> libraryDependencies += "org.apache.spark" %% 

Re: output the datas(txt)

2016-02-28 Thread Chandeep Singh
Here is what you can do.

// Recreating your RDD
val a = Array(Array(1, 2, 3), Array(2, 3, 4), Array(3, 4, 5), Array(6, 7, 8))
val b = sc.parallelize(a)

val c = b.map(x => (x(0) + " " + x(1) + " " + x(2)))

// Collect to 
c.collect()
—> res3: Array[String] = Array(1 2 3, 2 3 4, 3 4 5, 6 7 8)

c.saveAsTextFile("test”)
—>
[csingh ~]$ hadoop fs -cat test/*
1 2 3
2 3 4
3 4 5
6 7 8

> On Feb 28, 2016, at 1:20 AM, Bonsen  wrote:
> 
> I get results from RDDs,
> like :
> Array(Array(1,2,3),Array(2,3,4),Array(3,4,6))
> how can I output them to 1.txt
> like :
> 1 2 3
> 2 3 4
> 3 4 6
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/output-the-datas-txt-tp26350.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



Re: Percentile calculation in spark 1.6

2016-02-23 Thread Chandeep Singh
This should help - 
http://stackoverflow.com/questions/28805602/how-to-compute-percentiles-in-apache-spark
 

> On Feb 23, 2016, at 10:08 AM, Arunkumar Pillai  
> wrote:
> 
> How to calculate percentile in spark 1.6 ?
> 
> 
> -- 
> Thanks and Regards
> Arun



Re: Read files dynamically having different schema under one parent directory + scala + Spakr 1.5,2

2016-02-20 Thread Chandeep Singh
Here is how you can list all HDFS directories for a given path.

val hadoopConf = new org.apache.hadoop.conf.Configuration()
val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new 
java.net.URI("hdfs://:8020"), hadoopConf)
val c = hdfsConn.listStatus(new org.apache.hadoop.fs.Path("/user/csingh/"))
c.foreach(x => println(x.getPath))

Output:
hdfs:///user/csingh/.Trash
hdfs:///user/csingh/.sparkStaging
hdfs:///user/csingh/.staging
hdfs:///user/csingh/test1
hdfs:///user/csingh/test2
hdfs:///user/csingh/tmp


> On Feb 20, 2016, at 2:37 PM, Divya Gehlot  wrote:
> 
> Hi,
> @Umesh :You understanding is partially correct as per my requirement.
> My idea which I try to implement is 
> Steps which I am trying to follow 
> (Not sure how feasible it is I am new new bee to spark and scala)
> 1.List all the files under parent directory 
>   hdfs :///Testdirectory/
> As list 
> For example : val listsubdirs =(subdir1,subdir2...subdir.n)
> Iterate through this list 
> for(subdir <-listsubdirs){
> val df ="df"+subdir
> df= read it using spark csv package using custom schema
> 
> }
> Will get dataframes equal to subdirs
> 
> Now I got stuck in first step itself .
> How do I list directories and put it in list ?
> 
> Hope you understood my issue now.
> Thanks,
> Divya 
> On Feb 19, 2016 6:54 PM, "UMESH CHAUDHARY"  > wrote:
> If I understood correctly, you can have many sub-dirs under 
> hdfs:///TestDirectory and and you need to attach a schema to all part files 
> in a sub-dir. 
> 
> 1) I am assuming that you know the sub-dirs names :
> 
> For that, you need to list all sub-dirs inside hdfs:///TestDirectory 
> using Scala, iterate over sub-dirs 
> foreach sub-dir in the list 
> read the partfiles , identify and attach schema respective to that 
> sub-directory. 
> 
> 2) If you don't know the sub-directory names:
> You need to store schema somewhere inside that sub-directory and read it 
> in iteration.
> 
> On Fri, Feb 19, 2016 at 3:44 PM, Divya Gehlot  > wrote:
> Hi,
> I have a use case ,where I have one parent directory
> 
> File stucture looks like 
> hdfs:///TestDirectory/spark1/part files( created by some spark job )
> hdfs:///TestDirectory/spark2/ part files (created by some spark job )
> 
> spark1 and spark 2 has different schema 
> 
> like spark 1  part files schema
> carname model year
> 
> Spark2 part files schema
> carowner city  carcost
> 
> 
> As these spark 1 and spark2 directory gets created dynamically 
> can have spark3 directory with different schema
> 
> M requirement is to read the parent directory and list sub drectory 
> and create dataframe for each subdirectory
> 
> I am not able to get how can I list subdirectory under parent directory and 
> dynamically create dataframes.
> 
> Thanks,
> Divya 
> 
> 
> 
> 
> 



Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Ah. Ok.

> On Feb 20, 2016, at 2:31 PM, Mich Talebzadeh  wrote:
> 
> Yes I did that as well but no joy. My shell does it for windows files 
> automatically
>  
> Thanks, 
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:c...@chandeep.com] 
> Sent: 20 February 2016 14:27
> To: Mich Talebzadeh 
> Cc: user @spark 
> Subject: Re: Checking for null values when mapping
>  
> Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ 
> <http://dos2unix.sourceforge.net/>)
>  
> Has helped me in the past to deal with special characters while using windows 
> based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))
>  
>> On Feb 20, 2016, at 2:17 PM, Chandeep Singh > <mailto:c...@chandeep.com>> wrote:
>>  
>> Understood. In that case Ted’s suggestion to check the length should solve 
>> the problem.
>>  
>>> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh >> <mailto:m...@peridale.co.uk>> wrote:
>>>  
>>> Hi,
>>>  
>>> That is a good question.
>>>  
>>> When data is exported from CSV to Linux, any character that cannot be 
>>> transformed is replaced by ?. That question mark is not actually the 
>>> expected “?” J
>>>  
>>> So the only way I can get rid of it is by drooping the first character 
>>> using substring(1). I checked I did the same in Hive sql
>>>  
>>> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>>>  
>>> HTH
>>>  
>>>  
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>>  
>>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>>  
>>> NOTE: The information in this email is proprietary and confidential. This 
>>> message is for the designated recipient only, if you are not the intended 
>>> recipient, you should destroy it immediately. Any information in this 
>>> message shall not be understood as given or endorsed by Peridale Technology 
>>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>>> the responsibility of the recipient to ensure that this email is virus 
>>> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>>> employees accept any responsibility.
>>>  
>>>  
>>> From: Chandeep Singh [mailto:c...@chandeep.com <mailto:c...@chandeep.com>] 
>>> Sent: 20 February 2016 13:47
>>> To: Mich Talebzadeh mailto:m...@peridale.co.uk>>
>>> Cc: user @spark mailto:user@spark.apache.org>>
>>> Subject: Re: Checking for null values when mapping
>>>  
>>> Looks like you’re using substring just to get rid of the ‘?’. Why not use 
>>> replace for that as well? And then you wouldn’t run into issues with index 
>>> out of bound.
>>>  
>>> val a = "?1,187.50"  
>>> val b = ""
>>>  
>>> println(a.substring(1).replace(",", "”))
>>> —> 1187.50
>>>  
>>> println(a.replace("?", "").replace(",", "”))
>>> —> 1187.50
>>>  
>>> println(b.replace("?", "").replace(",", "”))
>>> —> No error / output since both ‘?' and ‘,' don’t exist.
>>>  
>>>  
>>>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh >>> <mailto:m...@peridale.co.uk>> wrote:
>>>>  
>>>>  
>>&g

Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Also, have you looked into Dos2Unix (http://dos2unix.sourceforge.net/ 
<http://dos2unix.sourceforge.net/>)

Has helped me in the past to deal with special characters while using windows 
based CSV’s in Linux. (Might not be the solution here.. Just an FYI :))

> On Feb 20, 2016, at 2:17 PM, Chandeep Singh  wrote:
> 
> Understood. In that case Ted’s suggestion to check the length should solve 
> the problem.
> 
>> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh > <mailto:m...@peridale.co.uk>> wrote:
>> 
>> Hi,
>>  
>> That is a good question.
>>  
>> When data is exported from CSV to Linux, any character that cannot be 
>> transformed is replaced by ?. That question mark is not actually the 
>> expected “?” J
>>  
>> So the only way I can get rid of it is by drooping the first character using 
>> substring(1). I checked I did the same in Hive sql
>>  
>> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>>  
>> HTH
>>  
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.
>>  
>>  
>> From: Chandeep Singh [mailto:c...@chandeep.com <mailto:c...@chandeep.com>] 
>> Sent: 20 February 2016 13:47
>> To: Mich Talebzadeh mailto:m...@peridale.co.uk>>
>> Cc: user @spark mailto:user@spark.apache.org>>
>> Subject: Re: Checking for null values when mapping
>>  
>> Looks like you’re using substring just to get rid of the ‘?’. Why not use 
>> replace for that as well? And then you wouldn’t run into issues with index 
>> out of bound.
>>  
>> val a = "?1,187.50"  
>> val b = ""
>>  
>> println(a.substring(1).replace(",", "”))
>> —> 1187.50
>>  
>> println(a.replace("?", "").replace(",", "”))
>> —> 1187.50
>>  
>> println(b.replace("?", "").replace(",", "”))
>> —> No error / output since both ‘?' and ‘,' don’t exist.
>>  
>>  
>>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh >> <mailto:m...@peridale.co.uk>> wrote:
>>>  
>>>  
>>> I have a DF like below reading a csv file
>>>  
>>>  
>>> val df = 
>>> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", 
>>> "true").option("header", "true").load("/data/stg/table2")
>>>  
>>> val a = df.map(x => (x.getString(0), x.getString(1), 
>>> x.getString(2).substring(1).replace(",", 
>>> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
>>> x.getString(4).substring(1).replace(",", "").toDouble))
>>>  
>>>  
>>> For most rows I am reading from csv file the above mapping works fine. 
>>> However, at the bottom of csv there are couple of empty columns as below
>>>  
>>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>>> []
>>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>>> []
>>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>>  
>>> However, I get 
>>>  
>>> a.collect.foreach(println)
>>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 
>>> 161)
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>>  
>>> I suspect the cause is substring operation  say x.getString(2).substring(1) 
>>> on empty values that according to web will throw this type of error
>>>  
>>>  
>>> The easiest solutio

Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Understood. In that case Ted’s suggestion to check the length should solve the 
problem.

> On Feb 20, 2016, at 2:09 PM, Mich Talebzadeh  wrote:
> 
> Hi,
>  
> That is a good question.
>  
> When data is exported from CSV to Linux, any character that cannot be 
> transformed is replaced by ?. That question mark is not actually the expected 
> “?” J
>  
> So the only way I can get rid of it is by drooping the first character using 
> substring(1). I checked I did the same in Hive sql
>  
> The actual field in CSV is “£2,500.oo” that translates into “?2,500.00”
>  
> HTH
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:c...@chandeep.com] 
> Sent: 20 February 2016 13:47
> To: Mich Talebzadeh 
> Cc: user @spark 
> Subject: Re: Checking for null values when mapping
>  
> Looks like you’re using substring just to get rid of the ‘?’. Why not use 
> replace for that as well? And then you wouldn’t run into issues with index 
> out of bound.
>  
> val a = "?1,187.50"  
> val b = ""
>  
> println(a.substring(1).replace(",", "”))
> —> 1187.50
>  
> println(a.replace("?", "").replace(",", "”))
> —> 1187.50
>  
> println(b.replace("?", "").replace(",", "”))
> —> No error / output since both ‘?' and ‘,' don’t exist.
>  
>  
>> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh > <mailto:m...@peridale.co.uk>> wrote:
>>  
>>  
>> I have a DF like below reading a csv file
>>  
>>  
>> val df = 
>> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", 
>> "true").option("header", "true").load("/data/stg/table2")
>>  
>> val a = df.map(x => (x.getString(0), x.getString(1), 
>> x.getString(2).substring(1).replace(",", 
>> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
>> x.getString(4).substring(1).replace(",", "").toDouble))
>>  
>>  
>> For most rows I am reading from csv file the above mapping works fine. 
>> However, at the bottom of csv there are couple of empty columns as below
>>  
>> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
>> []
>> [Net income,,?182,531.25,?14,606.25,?197,137.50]
>> []
>> [year 2014,,?113,500.00,?0.00,?113,500.00]
>> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>>  
>> However, I get 
>>  
>> a.collect.foreach(println)
>> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 
>> 161)
>> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>>  
>> I suspect the cause is substring operation  say x.getString(2).substring(1) 
>> on empty values that according to web will throw this type of error
>>  
>>  
>> The easiest solution seems to be to check whether x above is not null and do 
>> the substring operation. Can this be done without using a UDF?
>>  
>> Thanks
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.



Re: Checking for null values when mapping

2016-02-20 Thread Chandeep Singh
Looks like you’re using substring just to get rid of the ‘?’. Why not use 
replace for that as well? And then you wouldn’t run into issues with index out 
of bound.

val a = "?1,187.50"  
val b = ""

println(a.substring(1).replace(",", "”))
—> 1187.50

println(a.replace("?", "").replace(",", "”))
—> 1187.50

println(b.replace("?", "").replace(",", "”))
—> No error / output since both ‘?' and ‘,' don’t exist.


> On Feb 20, 2016, at 8:24 AM, Mich Talebzadeh  wrote:
> 
>  
> I have a DF like below reading a csv file
>  
>  
> val df = 
> HiveContext.read.format("com.databricks.spark.csv").option("inferSchema", 
> "true").option("header", "true").load("/data/stg/table2")
>  
> val a = df.map(x => (x.getString(0), x.getString(1), 
> x.getString(2).substring(1).replace(",", 
> "").toDouble,x.getString(3).substring(1).replace(",", "").toDouble, 
> x.getString(4).substring(1).replace(",", "").toDouble))
>  
>  
> For most rows I am reading from csv file the above mapping works fine. 
> However, at the bottom of csv there are couple of empty columns as below
>  
> [421,02/10/2015,?1,187.50,?237.50,?1,425.00]
> []
> [Net income,,?182,531.25,?14,606.25,?197,137.50]
> []
> [year 2014,,?113,500.00,?0.00,?113,500.00]
> [Year 2015,,?69,031.25,?14,606.25,?83,637.50]
>  
> However, I get 
>  
> a.collect.foreach(println)
> 16/02/20 08:31:53 ERROR Executor: Exception in task 0.0 in stage 123.0 (TID 
> 161)
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>  
> I suspect the cause is substring operation  say x.getString(2).substring(1) 
> on empty values that according to web will throw this type of error
>  
>  
> The easiest solution seems to be to check whether x above is not null and do 
> the substring operation. Can this be done without using a UDF?
>  
> Thanks
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.



Re: Hive REGEXP_REPLACE use or equivalent in Spark

2016-02-19 Thread Chandeep Singh
You might be better off using the CSV loader in this case. 
https://github.com/databricks/spark-csv 


Input:
[csingh ~]$ hadoop fs -cat test.csv
360,10/02/2014,"?2,500.00",?0.00,"?2,500.00”

and here is quick ad dirty way to resolve your issue..

val df = 
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", 
"true").load("test.csv")
—> df: org.apache.spark.sql.DataFrame = [C0: int, C1: string, C2: string, C3: 
string, C4: string

df.first()
—> res0: org.apache.spark.sql.Row = [360,10/02/2014,?2,500.00,?0.00,?2,500.00]

val a = df.map(x => (x.getInt(0), x.getString(1), x.getString(2).replace("?", 
"").replace(",", ""), x.getString(3).replace("?", ""), 
x.getString(4).replace("?", "").replace(",", "")))
—> a: org.apache.spark.rdd.RDD[(Int, String, String, String, String)] = 
MapPartitionsRDD[17] at map at :21

a.collect()
—> res1: Array[(Int, String, String, String, String)] = 
Array((360,10/02/2014,2500.00,0.00,2500.00))

> On Feb 19, 2016, at 9:06 AM, Mich Talebzadeh  wrote:
> 
> Ok
>  
> I have created a one liner csv file as follows:
>  
> cat testme.csv
> 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"
>  
> I use the following in Spark to split it
>  
> csv=sc.textFile("/data/incoming/testme.csv")
> csv.map(_.split(",")).first
> res159: Array[String] = Array(360, 10/02/2014, "?2, 500.00", ?0.00, "?2, 
> 500.00")
>  
> That comes back with an array
>  
> Now all I want is to get rid of “?” and “,” in above. The problem is I have a 
> currency field “?2,500.00” that has got an additional “,” as well that messes 
> up things
>  
> replaceAll() does not work
>  
> Any other alternatives?
>  
> Thanks,
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Andrew Ehrlich [mailto:and...@aehrlich.com] 
> Sent: 19 February 2016 01:22
> To: Mich Talebzadeh 
> Cc: User 
> Subject: Re: Hive REGEXP_REPLACE use or equivalent in Spark
>  
> Use the scala method .split(",") to split the string into a collection of 
> strings, and try using .replaceAll() on the field with the "?" to remove it.
>  
> On Thu, Feb 18, 2016 at 2:09 PM, Mich Talebzadeh  > wrote:
>> Hi,
>> 
>> What is the equivalent of this Hive statement in Spark
>> 
>>  
>> 
>> select "?2,500.00", REGEXP_REPLACE("?2,500.00",'[^\\d\\.]','');
>> ++--+--+
>> |_c0 |   _c1|
>> ++--+--+
>> | ?2,500.00  | 2500.00  |
>> ++--+--+
>> 
>> Basically I want to get rid of "?" and "," in the csv file
>> 
>>  
>> 
>> The full csv line is
>> 
>>  
>> 
>> scala> csv2.first
>> res94: String = 360,10/02/2014,"?2,500.00",?0.00,"?2,500.00"
>> 
>> I want to transform that string into 5 columns and use "," as the split
>> 
>> Thanks,
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> 
>>  
>> http://talebzadehmich.wordpress.com 
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.



Re: SparkOnHBase : Which version of Spark its available

2016-02-17 Thread Chandeep Singh
HBase-Spark module was added in 1.3

https://issues.apache.org/jira/browse/HBASE-13992 


http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
 

http://blog.cloudera.com/blog/2014/12/new-in-cloudera-labs-sparkonhbase/ 


> On Feb 17, 2016, at 9:44 AM, Divya Gehlot  wrote:
> 
> Hi,
> 
> SparkonHBase is integrated with which version of Spark and HBase ?
> 
> 
> 
> 
> 
> Thanks,
> Divya 



Re: Use case for RDD and Data Frame

2016-02-16 Thread Chandeep Singh
Ah. My bad! :)

> On Feb 16, 2016, at 6:24 PM, Mich Talebzadeh  wrote:
> 
> Thanks Chandeep.
>  
> Andy Grove, the author earlier on pointed to that article in an earlier 
> thread J
>  
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:c...@chandeep.com] 
> Sent: 16 February 2016 18:17
> To: Mich Talebzadeh 
> Cc: Ashok Kumar ; User 
> Subject: Re: Use case for RDD and Data Frame
>  
> Here is another interesting post.
>  
> http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
>  
> <http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer>
>  
>> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh > <mailto:m...@peridale.co.uk>> wrote:
>>  
>> Hi,
>>  
>> A Resilient  Distributed Dataset (RDD) is a heap of data distributed among 
>> all nodes of cluster. It is basically raw data and that is all about it with 
>> little optimization on it. Remember data is not much of a value until it is 
>> turned into information.
>>  
>> On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a 
>> table in Oracle or Sybase. In other words a two-dimensional array-like 
>> structure, in which each column contains measurements on one variable, and 
>> each row contains one case.
>>  
>> So, a DataFrame by definition has additional metadata due to its tabular 
>> format, which allows Spark Optimizer AKA Catalyst  to take advantage of this 
>> tabular format for certain optimizations. So still after so many years, the 
>> relational model is arguably the most elegant model known and used and 
>> emulated everywhere. 
>>  
>> Much like a table in RDBMS, a DataFrame keeps track of the schema and 
>> supports various relational operations that lead to more optimized 
>> execution. Essentially each DataFrame object represents a logical plan but 
>> because of their "lazy" nature no execution occurs until the user calls a 
>> specific "output operation". This is very important to remember. You can go 
>> from a DataFrame to an RDD via its rdd method. You can go from an RDD to a 
>> DataFrame (if the RDD is in a tabular format) via the toDF method.
>>  
>> In general it is recommended to use a DataFrame where possible due to the 
>> built in query optimization.
>>  
>> For those familiar with SQL a DataFrame can be conveniently registered as a 
>> temporary table and SQL operations can be performed on it.
>>  
>> Case in point I am looking for all my replication server log files 
>> compressed and stored in an HDFS directory for error on a specific connection
>>  
>> //create an RDD
>> val rdd = sc.textFile("/test/REP_LOG.gz")
>> //convert it to Data Frame
>> val df = rdd.toDF("line")
>> //register the line as a temporary table
>> df.registerTempTable("t")
>> println("\n Search for ERROR plus another word in table t\n")
>> sql("select * from t WHERE line like '%ERROR%' and line like 
>> '%hiveserver2.asehadoop%'").collect().foreach(println)
>>  
>> Alternatively you can just use method calls on the DataFrame itself to 
>> filter out the word
>>  
>> df.filter(col("line").like("%ERROR%")).collect.foreach(println)
>>  
>> HTH,
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6A

Re: Frustration over Spark and Jackson

2016-02-16 Thread Chandeep Singh
Shading worked pretty well for me when I ran into an issue similar to yours. 
POM is all you need to change.


org.apache.maven.plugins
maven-shade-plugin
1.6


package

shade




*:*


META-INF/*.SF

META-INF/*.DSA

META-INF/*.RSA






com.group.id.Launcher1







> On Feb 16, 2016, at 5:10 PM, Sean Owen  wrote:
> 
> Shading is the answer. It should be transparent to you though if you
> only apply it at the module where you create the deployable assembly
> JAR.
> 
> On Tue, Feb 16, 2016 at 5:08 PM, Martin Skøtt  wrote:
>> Hi,
>> 
>> I recently started experimenting with Spark Streaming for ingesting and
>> enriching content from a Kafka stream. Being new to Spark I expected a bit
>> of a learning curve, but not with something as simple a using JSON data!
>> 
>> I have a JAR with common classes used across a number of Java projects which
>> I would also like to use in my Spark projects, but it uses a version of
>> Jackson which is newer than the one packaged with Spark - I can't (and
>> won't) downgrade to the older version in Spark. Any suggestions on how to
>> solve this?
>> 
>> I have looked at using the shade plugin to rename my version of Jackson, but
>> that would require me to change my common code which I would like to avoid.
>> 
>> 
>> --
>> Kind regards
>> Martin
>> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Use case for RDD and Data Frame

2016-02-16 Thread Chandeep Singh
Here is another interesting post.

http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
 


> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh  wrote:
> 
> Hi,
>  
> A Resilient  Distributed Dataset (RDD) is a heap of data distributed among 
> all nodes of cluster. It is basically raw data and that is all about it with 
> little optimization on it. Remember data is not much of a value until it is 
> turned into information.
>  
> On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a 
> table in Oracle or Sybase. In other words a two-dimensional array-like 
> structure, in which each column contains measurements on one variable, and 
> each row contains one case.
>  
> So, a DataFrame by definition has additional metadata due to its tabular 
> format, which allows Spark Optimizer AKA Catalyst  to take advantage of this 
> tabular format for certain optimizations. So still after so many years, the 
> relational model is arguably the most elegant model known and used and 
> emulated everywhere. 
>  
> Much like a table in RDBMS, a DataFrame keeps track of the schema and 
> supports various relational operations that lead to more optimized execution. 
> Essentially each DataFrame object represents a logical plan but because of 
> their "lazy" nature no execution occurs until the user calls a specific 
> "output operation". This is very important to remember. You can go from a 
> DataFrame to an RDD via its rdd method. You can go from an RDD to a DataFrame 
> (if the RDD is in a tabular format) via the toDF method.
>  
> In general it is recommended to use a DataFrame where possible due to the 
> built in query optimization.
>  
> For those familiar with SQL a DataFrame can be conveniently registered as a 
> temporary table and SQL operations can be performed on it.
>  
> Case in point I am looking for all my replication server log files compressed 
> and stored in an HDFS directory for error on a specific connection
>  
> //create an RDD
> val rdd = sc.textFile("/test/REP_LOG.gz")
> //convert it to Data Frame
> val df = rdd.toDF("line")
> //register the line as a temporary table
> df.registerTempTable("t")
> println("\n Search for ERROR plus another word in table t\n")
> sql("select * from t WHERE line like '%ERROR%' and line like 
> '%hiveserver2.asehadoop%'").collect().foreach(println)
>  
> Alternatively you can just use method calls on the DataFrame itself to filter 
> out the word
>  
> df.filter(col("line").like("%ERROR%")).collect.foreach(println)
>  
> HTH,
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> 
>  
> http://talebzadehmich.wordpress.com 
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID 
> ] 
> Sent: 16 February 2016 16:06
> To: User mailto:user@spark.apache.org>>
> Subject: Use case for RDD and Data Frame
>  
> Gurus,
>  
> What are the main differences between a Resilient Distributed Data (RDD) and 
> Data Frame (DF)
>  
> Where one can use RDD without transforming it to DF?
>  
> Regards and obliged



Re: New line lost in streaming output file

2016-02-16 Thread Chandeep Singh
!rdd.isEmpty() should work but an alternative could be rdd.take(1) != 0 

> On Feb 16, 2016, at 9:33 AM, Ashutosh Kumar  wrote:
> 
> I am getting multiple empty files for streaming output for each interval.
> To Avoid this I tried 
> 
>  kStream.foreachRDD(new VoidFunction2,Time>(){
>public void call(JavaRDD rdd,Time time) throws Exception {
> if(!rdd.isEmpty()){
>
> rdd.saveAsTextFile("filename_"+time.milliseconds()+".csv");
> }
> }
> 
> This prevents writing of empty files. However this appends line after one 
> another by removing new lines. All lines are merged. 
> How do I retain my new line? 
> 
> Thanks



Re: Single context Spark from Python and Scala

2016-02-15 Thread Chandeep Singh
You could consider using Zeppelin - 
https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html 


https://zeppelin.incubator.apache.org/ 

ZeppelinContext
Zeppelin automatically injects ZeppelinContext as variable 'z' in your 
scala/python environment. ZeppelinContext provides some additional functions 
and utility.



Object exchange

ZeppelinContext extends map and it's shared between scala, python environment. 
So you can put some object from scala and read it from python, vise versa.

Put object from scala

%spark
val myObject = ...
z.put("objName", myObject)
Get object from python

%python
myObject = z.get("objName")


> On Feb 15, 2016, at 12:10 PM, Leonid Blokhin  wrote:
> 
> Hello
> 
>  I want to work with single context Spark from Python and Scala. Is it 
> possible?
> 
> Is it possible to do betwen started  ./bin/pyspark and ./bin/spark-shell for 
> dramatic example?
> 
> 
> 
> Cheers,
> 
> Leonid
> 



Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Chandeep Singh
You could also fire up a VNC session and access all internal pages from there.

> On Feb 15, 2016, at 9:19 AM, Divya Gehlot  wrote:
> 
> Hi Sabarish,
> Thanks alot for your help.
> I am able to view the logs now 
> 
> Thank you very much .
> 
> Cheers,
> Divya 
> 
> 
> On 15 February 2016 at 16:51, Sabarish Sasidharan 
> mailto:sabarish.sasidha...@manthan.com>> 
> wrote:
> You can setup SSH tunneling.
> 
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
>  
> 
> 
> Regards
> Sab
> 
> On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot  > wrote:
> Hi,
> I have hadoop cluster set up in EC2.
> I am unable to view application logs in Web UI as its taking internal IP 
> Like below :
> http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042 
> 
> 
> How can I change this to external one or redirecting to external ?
> Attached screenshots for better understanding of my issue.
> 
> Would really appreciate help.
> 
> 
> Thanks,
> Divya 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
> 
> -- 
> 
> Architect - Big Data
> Ph: +91 99805 99458 
> 
> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
> India ICT)
> +++
> 



Re: new to Spark - trying to get a basic example to run - could use some help

2016-02-13 Thread Chandeep Singh
Try looking at stdout logs. I ran the exactly same job as you and did not
see anything on the console as well but found it in stdout.

[csingh@<> ~]$ spark-submit   --class org.apache.spark.examples.SparkPi
 --master yarn--deploy-mode cluster--name RT_SparkPi
/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
   10

Log Type: stdout

Log Upload Time: Sat Feb 13 11:00:08 + 2016

Log Length: 23

Pi is roughly 3.140224


Hope that helps!


On Sat, Feb 13, 2016 at 3:14 AM, Taylor, Ronald C 
wrote:

> Hello folks,
>
> This is my first msg to the list. New to Spark, and trying to run the
> SparkPi example shown in the Cloudera documentation.  We have Cloudera
> 5.5.1 running on a small cluster at our lab, with Spark 1.5.
>
> My trial invocation is given below. The output that I get **says** that I
> “SUCCEEDED” at the end. But – I don’t get any screen output on the value of
> pi. I also tried a SecondarySort Spark program  that I compiled and jarred
> from Dr. Parsian’s Data Algorithms book. That program  failed. So – I am
> focusing on getting SparkPi to work properly, to get started. Can somebody
> look at the screen output that I cut-and-pasted below and infer what I
> might be doing wrong?
>
> Am I forgetting to set one or more environment variables? Or not setting
> such properly?
>
> Here is the CLASSPATH value that I set:
>
>
> CLASSPATH=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:/people/rtaylor/SparkWork/DataAlgUtils
>
> Here is the settings of other environment variables:
>
> HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
> SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11
>
> HADOOP_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar'
>
> SPARK_CLASSPATH='/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/*:$JAVA_HOME/lib/tools.jar:':'/people/rtaylor/SparkWork/DataAlgUtils'
>
> I am not sure that those env vars are properly set (or if even all of them
> are needed). But that’s what I’m currently using.
>
> As I said, the invocation below appears to terminate with final status set
> to “SUCCEEDED”. But – there is no screen output on the value of pi, which I
> understood would be shown. So – something appears to be going wrong. I went
> to the tracking URL given at the end, but could not access it.
>
> I would very much appreciate some guidance!
>
>
>- Ron Taylor
>
>
> %
>
> INVOCATION:
>
> [rtaylor@bigdatann]$ spark-submit   --class
> org.apache.spark.examples.SparkPi--master yarn--deploy-mode
> cluster--name RT_SparkPi
> /opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-examples-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
>10
>
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/slf4j-simple-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/livy-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/avro-tools-1.7.6-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/pig-0.12.0-cdh5.5.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 16/02/12 18:16:59 INFO client.RMProxy: Connecting to ResourceManager at
> bigdatann.ib/172.17.115.18:8032
> 16/02/12 18:16:59 INFO yarn.Client: Requesting a new application from
> cluster with 15 NodeManagers
> 16/02/12 18:16:59 INFO yarn.Client: Verifying our application has not
> requested more than the maximum memory capability of the cluster (65536 MB
> per container)
> 16/02/12 18:16:59 INFO yarn.Client: Will allocate AM container, with 1408
> MB memory including 384 MB overhead
> 16/02/12 18:16:59 INFO yarn.Client: Setting up container launch context
> for our AM
> 16/02/12 18:16:59 INFO yarn.Client: Setting up the launch environment for
> our AM container
> 16/02/12 18:16:59 INFO yarn.Client: Preparing resources for our AM
> container
> 16/02/12 18:17:00 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 16/02/12 18:17:00 INFO yarn.Client: Uploading resource
> file:/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p0.11/jars/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
> ->
> hdfs://bigdatann.ib:8020/user/rtaylor/.sparkStaging/application_1454115464826_0070/spark-assembly-1.5.0-cdh5.5.1-hadoop2.6.0-cdh5.5.1.jar
> 16/02/12 18:17:2

Re: retrieving all the rows with collect()

2016-02-10 Thread Chandeep Singh
Hi Mich,

If you would like to print everything to the console you could - errlog.
filter(line => line.contains("sed"))collect()foreach(println)

or you could always save to a file using any of the saveAs methods.

Thanks,
Chandeep

On Wed, Feb 10, 2016 at 8:14 PM, <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

>
>
> Hi,
>
> I have a bunch of files stored in hdfs /unit_files directory in total 319 
> files
> scala> val errlog = sc.textFile("/unix_files/*.ksh")
>
> scala>  errlog.filter(line => line.contains("sed"))count()
> res104: Long = 1113
> So it returns 1113 instances the word "sed"
>
> If I want to see the collection I can do
>
>
> *scala>  errlog.filter(line => line.contains("sed"))collect()*
>
> res105: Array[String] = Array(" DSQUERY=${1} ; 
> DBNAME=${2} ; ERROR=0 ; PROGNAME=$(basename $0 | sed -e s/.ksh//)", #. in 
> environment based on argument for script., "   exec sp_spaceused", "  
>   exec sp_spaceused", PROGNAME=$(basename $0 | sed -e s/.ksh//), "
> BACKUPSERVER=$5# Server that is used to load the transaction dump", " 
>BACKUPSERVER=$5 # Server that is used to load the transaction 
> dump", "BACKUPSERVER=$5 # Server that is used to load the 
> transaction dump", "cat $TMPDIR/${DBNAME}_trandump.sql | sed 
> s/${DSQUERY}/${REMOTESERVER}/ > $TMPDIR/${DBNAME}_trandump.tmpsql", cat 
> $TMPDIR/${DBNAME}_tran_transfer.sql | sed s/${DSQUERY}/${REMOTESERVER}/ > 
> $TMPDIR/${DBNAME}_tran_transfer.tmpsql, PROGNAME=$(basename $0 | sed -e 
> s/.ksh//), "B...
> scala>
>
>
> Now is there anyway I can retrieve all these instances or perhaps they are 
> all wrapped up and I only see few lines?
>
> Thanks,
>
> Mich
>
>