Re: Spark DataFrameNaFunctions unrecognized

2016-02-15 Thread satish chandra j
HI Ted,

Please find the error below:

[ERROR] C:\workspace\etl\src\main\scala\stg_mds_pmds_rec_df.scala:116:
error: value na is not a member of org.apache.spark.sql.DataFrame

[ERROR]  var nw_cmpr_df=cmpr_df.na.fill("column1",)

[ERROR]

Please let me know if any further details required on the same

Regards,
Satish Chandra


On Tue, Feb 16, 2016 at 1:03 PM, Ted Yu  wrote:

> bq. I am getting compile time error
>
> Do you mind pastebin'ning the error you got ?
>
> Cheers
>
> On Mon, Feb 15, 2016 at 11:08 PM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> HI Ted,
>> I understand it works fine if executed in Spark Shell
>> Sorry, I missed to mention that I am getting compile time error( using
>> Maven for build)
>> I am executing my Spark Job in remote client by submitting the exe jar
>> file
>>
>> Now do I need to import any specific packages make DataFrameNaFucntions
>> working
>>
>> Hence please let me know if any inputs on the same to fix the issue
>>
>> Regards,
>> Satish Chandra
>>
>>
>>
>>
>>
>> On Mon, Feb 15, 2016 at 7:41 PM, Ted Yu  wrote:
>>
>>> fill() was introduced in 1.3.1
>>>
>>> Can you show code snippet which reproduces the error ?
>>>
>>> I tried the following using spark-shell on master branch:
>>>
>>> scala> df.na.fill(0)
>>> res0: org.apache.spark.sql.DataFrame = [col: int]
>>>
>>> Cheers
>>>
>>> On Mon, Feb 15, 2016 at 3:36 AM, satish chandra j <
>>> jsatishchan...@gmail.com> wrote:
>>>
 Hi All,
 Currently I am using Spark 1.4.0 version, getting error when trying to
 use "fill" function which is one among DataFrameNaFunctions

 Snippet:
 df.na.fill(col: )

 Error:
 value na is not a member of org.apache.spark.sql.DataFrame

 As I need null values in column "col" of DataFrame "df" to be replaced
 with value "" as given in the above snippet.

 I understand, code does not require any additional packages to support
 DataFrameNaFunctions

 Please let me know if I am missing anything so that I can make these
 DataFrameNaFunctions working

 Regards,
 Satish Chandra J

>>>
>>>
>>
>


Re: Spark DataFrameNaFunctions unrecognized

2016-02-15 Thread Ted Yu
bq. I am getting compile time error

Do you mind pastebin'ning the error you got ?

Cheers

On Mon, Feb 15, 2016 at 11:08 PM, satish chandra j  wrote:

> HI Ted,
> I understand it works fine if executed in Spark Shell
> Sorry, I missed to mention that I am getting compile time error( using
> Maven for build)
> I am executing my Spark Job in remote client by submitting the exe jar
> file
>
> Now do I need to import any specific packages make DataFrameNaFucntions
> working
>
> Hence please let me know if any inputs on the same to fix the issue
>
> Regards,
> Satish Chandra
>
>
>
>
>
> On Mon, Feb 15, 2016 at 7:41 PM, Ted Yu  wrote:
>
>> fill() was introduced in 1.3.1
>>
>> Can you show code snippet which reproduces the error ?
>>
>> I tried the following using spark-shell on master branch:
>>
>> scala> df.na.fill(0)
>> res0: org.apache.spark.sql.DataFrame = [col: int]
>>
>> Cheers
>>
>> On Mon, Feb 15, 2016 at 3:36 AM, satish chandra j <
>> jsatishchan...@gmail.com> wrote:
>>
>>> Hi All,
>>> Currently I am using Spark 1.4.0 version, getting error when trying to
>>> use "fill" function which is one among DataFrameNaFunctions
>>>
>>> Snippet:
>>> df.na.fill(col: )
>>>
>>> Error:
>>> value na is not a member of org.apache.spark.sql.DataFrame
>>>
>>> As I need null values in column "col" of DataFrame "df" to be replaced
>>> with value "" as given in the above snippet.
>>>
>>> I understand, code does not require any additional packages to support
>>> DataFrameNaFunctions
>>>
>>> Please let me know if I am missing anything so that I can make these
>>> DataFrameNaFunctions working
>>>
>>> Regards,
>>> Satish Chandra J
>>>
>>
>>
>


Saving Kafka Offsets to Cassandra at begining of each batch in Spark Streaming

2016-02-15 Thread Abhishek Anand
I have a kafka rdd and I need to save the offsets to cassandra table at the
begining of each batch.

Basically I need to write the offsets of the type Offsets below that I am
getting inside foreachRD, to cassandra. The javafunctions api to write to
cassandra needs a rdd. How can I create a rdd from offsets and write to
cassandra table.


public static void writeOffsets(JavaPairDStream kafkastream){
kafkastream.foreachRDD((rdd,batchMilliSec) -> {
OffsetRange[] offsets = ((HasOffsetRanges) rdd.rdd()).offsetRanges();
return null;
});


Thanks !!
Abhi


Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Gavin Yue
This sqlContext is one instance of hive context, do not be confused by the 
name.  



> On Feb 16, 2016, at 12:51, Prabhu Joseph  wrote:
> 
> Hi All,
> 
> On creating HiveContext in spark-shell, fails with 
> 
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted the 
> database /SPARK/metastore_db.
> 
> Spark-Shell already has created metastore_db for SqlContext. 
> 
> Spark context available as sc.
> SQL context available as sqlContext.
> 
> But without HiveContext, i am able to query the data using SqlContext . 
> 
> scala>  var df = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("inferSchema", "true").load("/SPARK/abc")
> df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]
> 
> So is there any real need for HiveContext inside Spark Shell. Is everything 
> that can be done with HiveContext, achievable with SqlContext inside Spark 
> Shell.
> 
> 
> 
> Thanks,
> Prabhu Joseph
> 
> 
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark DataFrameNaFunctions unrecognized

2016-02-15 Thread satish chandra j
HI Ted,
I understand it works fine if executed in Spark Shell
Sorry, I missed to mention that I am getting compile time error( using
Maven for build)
I am executing my Spark Job in remote client by submitting the exe jar file

Now do I need to import any specific packages make DataFrameNaFucntions
working

Hence please let me know if any inputs on the same to fix the issue

Regards,
Satish Chandra





On Mon, Feb 15, 2016 at 7:41 PM, Ted Yu  wrote:

> fill() was introduced in 1.3.1
>
> Can you show code snippet which reproduces the error ?
>
> I tried the following using spark-shell on master branch:
>
> scala> df.na.fill(0)
> res0: org.apache.spark.sql.DataFrame = [col: int]
>
> Cheers
>
> On Mon, Feb 15, 2016 at 3:36 AM, satish chandra j <
> jsatishchan...@gmail.com> wrote:
>
>> Hi All,
>> Currently I am using Spark 1.4.0 version, getting error when trying to
>> use "fill" function which is one among DataFrameNaFunctions
>>
>> Snippet:
>> df.na.fill(col: )
>>
>> Error:
>> value na is not a member of org.apache.spark.sql.DataFrame
>>
>> As I need null values in column "col" of DataFrame "df" to be replaced
>> with value "" as given in the above snippet.
>>
>> I understand, code does not require any additional packages to support
>> DataFrameNaFunctions
>>
>> Please let me know if I am missing anything so that I can make these
>> DataFrameNaFunctions working
>>
>> Regards,
>> Satish Chandra J
>>
>
>


Error when doing a SaveAstable on a Spark dataframe

2016-02-15 Thread SRK
Hi,

I get an error when I do a SaveAsTable as shown below. I do have write
access to the hive volume. Any idea as to why this is happening?

 val df = testDF.toDF("id", "rec")

  df.printSchema()

val options = Map("path" -> "/hive/test.db/")

   
df.write.format("parquet").partitionBy("id").options(options).mode(SaveMode.Append).saveAsTable("sessRecs")

16/02/15 19:04:41 WARN scheduler.TaskSetManager: Lost task 369.0 in stage
2.0 (): org.apache.spark.SparkException: Task failed while writing rows.

at
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:393)

at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)

at
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

Caused by: java.lang.RuntimeException: Failed to commit task

at
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.commitTask$2(WriterContainer.scala:422)

at
org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:388)

... 8 more

Caused by: java.io.IOException: Error: Read-only file system(30), file:
test, user name: test, ID: 12345678





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-doing-a-SaveAstable-on-a-Spark-dataframe-tp26232.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark on Windows

2016-02-15 Thread UMESH CHAUDHARY
You can check "spark.master" property in conf/spark-defaults.conf and try
to give IP of the VM in place of "localhost".

On Tue, Feb 16, 2016 at 7:48 AM, KhajaAsmath Mohammed <
mdkhajaasm...@gmail.com> wrote:

> Hi,
>
> I am new to spark and starting working on it by writing small programs. I
> am able to run those in cloudera quickstart VM but not able to run in the
> eclipse when giving master URL
>
> *Steps I perfromed:*
>
> Started Master and can access it through http://localhost:8080
>
> Started worker and access it.
>
> Ran the wordcount by giving master as spark://localhost:7077 but no output
> and I cant see the application Id also in master web UI.
>
> I tried with master as local and was able to run successfully. I want to
> run on the master so that I can view logs in master and worker. any
> suggestions for this?
>
> Thanks,
> Asmath
>
>
>


Re: Side effects of using var inside a class object in a Rdd

2016-02-15 Thread Ted Yu
Age can be computed from the birthdate.
Looks like it doesn't need to be a member of Animal class.

If age is just for illustration, can you give an example which better
mimics the scenario you work on ?

Cheers

On Mon, Feb 15, 2016 at 8:53 PM, Hemalatha A <
hemalatha.amru...@googlemail.com> wrote:

> Hello,
>
> I want to know what are the cons and performance impacts of using a var
> inside class object in a Rdd.
>
>
> Here is a example:
>
> Animal is a huge class with n number of val type variables (approx >600
> variables), but frequently, we will have to update Age(just 1 variable)
> after some computation. What is the best way to do it?
>
> Class Animal(age: Int, name; String) = {
>  var animalAge:Int  = age
>  val animalName:String  = name
> val ..
> }
>
>
> val animalRdd = sc.parallelize(List(Animal(1,"XYZ"), Animal(2,"ABC") ))
> ...
> ...
> animalRdd.map(ani=>{
>  if(ani.yearChange()) ani.animalAge+=1
>  ani
> })
>
>
> Is it advisable to use var in this case? Or can I do ani.copy(animalAge=2)
> which will reallocate the memory altogether for the animal. Please advice
> which is the best way to handle such cases.
>
>
>
> Regards
> Hemalatha
>


Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
Thanks Mark, that answers my question.

On Tue, Feb 16, 2016 at 10:55 AM, Mark Hamstra 
wrote:

> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
>
>   /_/
>
>
>
> Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.8.0_72)
>
> Type in expressions to have them evaluated.
>
> Type :help for more information.
>
>
> scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]
>
> res0: Boolean = true
>
>
>
> On Mon, Feb 15, 2016 at 8:51 PM, Prabhu Joseph  > wrote:
>
>> Hi All,
>>
>> On creating HiveContext in spark-shell, fails with
>>
>> Caused by: ERROR XSDB6: Another instance of Derby may have already booted
>> the database /SPARK/metastore_db.
>>
>> Spark-Shell already has created metastore_db for SqlContext.
>>
>> Spark context available as sc.
>> SQL context available as sqlContext.
>>
>> But without HiveContext, i am able to query the data using SqlContext .
>>
>> scala>  var df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load("/SPARK/abc")
>> df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]
>>
>> So is there any real need for HiveContext inside Spark Shell. Is
>> everything that can be done with HiveContext, achievable with SqlContext
>> inside Spark Shell.
>>
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>>
>>
>


Re: Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Mark Hamstra
Welcome to

    __

 / __/__  ___ _/ /__

_\ \/ _ \/ _ `/ __/  '_/

   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT

  /_/



Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java
1.8.0_72)

Type in expressions to have them evaluated.

Type :help for more information.


scala> sqlContext.isInstanceOf[org.apache.spark.sql.hive.HiveContext]

res0: Boolean = true



On Mon, Feb 15, 2016 at 8:51 PM, Prabhu Joseph 
wrote:

> Hi All,
>
> On creating HiveContext in spark-shell, fails with
>
> Caused by: ERROR XSDB6: Another instance of Derby may have already booted
> the database /SPARK/metastore_db.
>
> Spark-Shell already has created metastore_db for SqlContext.
>
> Spark context available as sc.
> SQL context available as sqlContext.
>
> But without HiveContext, i am able to query the data using SqlContext .
>
> scala>  var df =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").option("inferSchema", "true").load("/SPARK/abc")
> df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]
>
> So is there any real need for HiveContext inside Spark Shell. Is
> everything that can be done with HiveContext, achievable with SqlContext
> inside Spark Shell.
>
>
>
> Thanks,
> Prabhu Joseph
>
>
>
>
>


Side effects of using var inside a class object in a Rdd

2016-02-15 Thread Hemalatha A
Hello,

I want to know what are the cons and performance impacts of using a var
inside class object in a Rdd.


Here is a example:

Animal is a huge class with n number of val type variables (approx >600
variables), but frequently, we will have to update Age(just 1 variable)
after some computation. What is the best way to do it?

Class Animal(age: Int, name; String) = {
 var animalAge:Int  = age
 val animalName:String  = name
val ..
}


val animalRdd = sc.parallelize(List(Animal(1,"XYZ"), Animal(2,"ABC") ))
...
...
animalRdd.map(ani=>{
 if(ani.yearChange()) ani.animalAge+=1
 ani
})


Is it advisable to use var in this case? Or can I do ani.copy(animalAge=2)
which will reallocate the memory altogether for the animal. Please advice
which is the best way to handle such cases.



Regards
Hemalatha


Creating HiveContext in Spark-Shell fails

2016-02-15 Thread Prabhu Joseph
Hi All,

On creating HiveContext in spark-shell, fails with

Caused by: ERROR XSDB6: Another instance of Derby may have already booted
the database /SPARK/metastore_db.

Spark-Shell already has created metastore_db for SqlContext.

Spark context available as sc.
SQL context available as sqlContext.

But without HiveContext, i am able to query the data using SqlContext .

scala>  var df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").load("/SPARK/abc")
df: org.apache.spark.sql.DataFrame = [Prabhu: string, Joseph: string]

So is there any real need for HiveContext inside Spark Shell. Is everything
that can be done with HiveContext, achievable with SqlContext inside Spark
Shell.



Thanks,
Prabhu Joseph


Re: recommendations with duplicate ratings

2016-02-15 Thread Nick Pentreath
Yes, for implicit data you need to sum up the "ratings" (actually view them
as "weights") for each user-item pair. I do this is my ALS application.

For ecommerce, say a "view" event has a weight of 1.0 and a "purchase" a
weight of 3.0. Then adding multiple events together for a given user and
item makes sense.

ALS assumes an input ratings matrix (even though Spark's implementation
takes an RDD[Rating]), so the algorithm itself doesn't support duplicate
ratings.

On Mon, 15 Feb 2016 at 23:24, Sean Owen  wrote:

> You're asking what happens when you put many ratings for one user-item
> pair in the input, right? I'm saying you shouldn't do that --
> aggregate them into one pair in your application.
>
> For rating-like (explicit) data, it doesn't really make sense
> otherwise. The only sensible aggregation is last-first, but there's no
> natural notion of 'last' in the RDD you supply.
>
> For count-like (implicit) data, it makes sense to sum the inputs, but
> I don't think that is done automatically. I skimmed the code and
> didn't see it. So you would sum the values per user-item anyway.
>
> On Mon, Feb 15, 2016 at 9:05 PM, Roberto Pagliari
>  wrote:
> > Hi Sean,
> > I¹m not sure what you mean by aggregate. The input of trainImplicit is an
> > RDD of Ratings.
> >
> > I find it odd that duplicate ratings would mess with ALS in the implicit
> > case. It¹d be nice if it didn¹t.
> >
> >
> > Thank you,
> >
> > On 15/02/2016 20:49, "Sean Owen"  wrote:
> >
> >>I believe you need to aggregate inputs per user-item in your call. I
> >>am actually not sure what happens if you don't. I think it would
> >>compute the factors twice and one would win, so yes I think it would
> >>effectively be ignored.  For implicit, that wouldn't work correctly,
> >>so you do need to aggregate.
> >>
> >>On Mon, Feb 15, 2016 at 8:30 PM, Roberto Pagliari
> >> wrote:
> >>> What happens when duplicate user/ratings are fed into ALS (the implicit
> >>> version, specifically)? Are duplicates ignored?
> >>>
> >>> I¹m asking because that would save me a distinct.
> >>>
> >>>
> >>>
> >>> Thank you,
> >>>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: SparkSQL/DataFrame - Is `JOIN USING` syntax null-safe?

2016-02-15 Thread Zhong Wang
Just checked the code and wrote some tests. Seems it is not null-safe...

Shall we consider providing a null-safe option for `JOIN USING` syntax?

Zhong

On Mon, Feb 15, 2016 at 7:25 PM, Zhong Wang  wrote:

> Is it null-safe when we use this interface?
> --
>
> def join(right: DataFrame, usingColumns: Seq[String], joinType: String): 
> DataFrame
>
>
> Thanks,
>
> Zhong
>
>


Re: New line lost in streaming output file

2016-02-15 Thread Ashutosh Kumar
Request to provide some pointer on this.
Thanks


On Mon, Feb 15, 2016 at 3:39 PM, Ashutosh Kumar 
wrote:

> I am getting multiple empty files for streaming output for each interval.
> To Avoid this I tried
>
>  kStream.foreachRDD(new VoidFunction2(){
>
>
>
>
>
> *public void call(JavaRDD rdd,Time time) throws Exception {
> if(!rdd.isEmpty()){
> rdd.saveAsTextFile("filename_"+time.milliseconds()+".csv");
> }}*
> This prevents writing of empty files. However this appends line after one
> another by removing new lines. All lines are merged.
> How do I retain my new line?
>
> Thanks
> Ashutosh
>


Getting java.lang.IllegalArgumentException: requirement failed while calling Sparks MLLIB StreamingKMeans from java application

2016-02-15 Thread Yogesh Vyas
Hi,
 I am trying to run a KMeansStreaming from the Java application, but
it gives the following error:

"Getting java.lang.IllegalArgumentException: requirement failed while
calling Sparks MLLIB StreamingKMeans from java application"

Below is my code:

JavaDStream v = trainingData.map(new Function() {

public Vector call(String arg0) throws Exception {
// TODO Auto-generated method stub
String[] p = arg0.split(",");
double[] d = new double[p.length] ;
for(int i=0;i

Re: IllegalArgumentException UnsatisfiedLinkError snappy-1.1.2 spark-shell error

2016-02-15 Thread Paolo Villaflores
Yes, I have sen that. But java.io.tmpdir has a default definition in
linux--it is /tmp.



On Tue, Feb 16, 2016 at 2:17 PM, Ted Yu  wrote:

> Have you seen this thread ?
>
>
> http://search-hadoop.com/m/q3RTtW43zT1e2nfb=Re+ibsnappyjava+so+failed+to+map+segment+from+shared+object
>
> On Mon, Feb 15, 2016 at 7:09 PM, Paolo Villaflores <
> pbvillaflo...@gmail.com> wrote:
>
>>
>> Hi,
>>
>>
>>
>> I am trying to run spark 1.6.0.
>>
>> I have previously just installed a fresh instance of hadoop 2.6.0 and
>> hive 0.14.
>>
>> Hadoop, mapreduce, hive and beeline are working.
>>
>> However, as soon as I run `sc.textfile()` within spark-shell, it returns
>> an error:
>>
>>
>> $ spark-shell
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>   /_/
>>
>> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.7.0_67)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> Spark context available as sc.
>> SQL context available as sqlContext.
>>
>> scala> val textFile = sc.textFile("README.md")
>> java.lang.IllegalArgumentException: java.lang.UnsatisfiedLinkError:
>> /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
>> /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
>> failed to map segment from shared object: Operation not permitted
>> at
>> org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:156)
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at
>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at
>> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72)
>> at
>> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
>> at org.apache.spark.broadcast.TorrentBroadcast.org
>> $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
>> at
>> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80)
>> at
>> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>> at
>> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>> at
>> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
>> at
>> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1014)
>> at
>> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1011)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>> at
>> org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
>> at
>> org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1011)
>> at
>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
>> at
>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>> at
>> org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
>> at
>> org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
>> at
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:36)
>> at $iwC$$iwC$$iwC$$iwC.(:38)
>> at $iwC$$iwC$$iwC.(:40)
>> at $iwC$$iwC.(:42)
>> at $iwC.(:44)
>> at (:46)
>> at .(:50)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>> at
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>> 

SparkSQL/DataFrame - Is `JOIN USING` syntax null-safe?

2016-02-15 Thread Zhong Wang
Is it null-safe when we use this interface?
--

def join(right: DataFrame, usingColumns: Seq[String], joinType:
String): DataFrame


Thanks,

Zhong


Re: IllegalArgumentException UnsatisfiedLinkError snappy-1.1.2 spark-shell error

2016-02-15 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtW43zT1e2nfb=Re+ibsnappyjava+so+failed+to+map+segment+from+shared+object

On Mon, Feb 15, 2016 at 7:09 PM, Paolo Villaflores 
wrote:

>
> Hi,
>
>
>
> I am trying to run spark 1.6.0.
>
> I have previously just installed a fresh instance of hadoop 2.6.0 and hive
> 0.14.
>
> Hadoop, mapreduce, hive and beeline are working.
>
> However, as soon as I run `sc.textfile()` within spark-shell, it returns
> an error:
>
>
> $ spark-shell
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>   /_/
>
> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_67)
> Type in expressions to have them evaluated.
> Type :help for more information.
> Spark context available as sc.
> SQL context available as sqlContext.
>
> scala> val textFile = sc.textFile("README.md")
> java.lang.IllegalArgumentException: java.lang.UnsatisfiedLinkError:
> /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
> /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
> failed to map segment from shared object: Operation not permitted
> at
> org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:156)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72)
> at
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
> at org.apache.spark.broadcast.TorrentBroadcast.org
> $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
> at
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80)
> at
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
> at
> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
> at
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1014)
> at
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1011)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at
> org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
> at
> org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1011)
> at
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
> at
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at
> org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
> at
> org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:36)
> at $iwC$$iwC$$iwC$$iwC.(:38)
> at $iwC$$iwC$$iwC.(:40)
> at $iwC$$iwC.(:42)
> at $iwC.(:44)
> at (:46)
> at .(:50)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at
> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at
> org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at
> 

Re: which is better RDD or Dataframe?

2016-02-15 Thread Ted Yu
Can you describe the types of query you want to perform ?

If you don't already have a data flow which is optimized for RDD, I would
suggest using Dataframe API (or event DataSet API) which gives optimizer
more room.

Cheers

On Mon, Feb 15, 2016 at 6:43 PM, Divya Gehlot 
wrote:

> Hi,
> I would like to know which gives better performance RDDs or dataframes ?
> Like for one scenario :
> 1.Read the file as RDD and register as temp table and fire SQL query
>
>  2.Read the file through Dataframe API or convert the RDD to dataframe and
> use dataframe APIs to process the data.
>
> For the scenario like above which gives better performance.
> Does any body have benchmark or statistical data regarding that ?
>
>
> Thanks,
> Divya
>


IllegalArgumentException UnsatisfiedLinkError snappy-1.1.2 spark-shell error

2016-02-15 Thread Paolo Villaflores
Hi,



I am trying to run spark 1.6.0.

I have previously just installed a fresh instance of hadoop 2.6.0 and hive
0.14.

Hadoop, mapreduce, hive and beeline are working.

However, as soon as I run `sc.textfile()` within spark-shell, it returns an
error:


$ spark-shell
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
  /_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.

scala> val textFile = sc.textFile("README.md")
java.lang.IllegalArgumentException: java.lang.UnsatisfiedLinkError:
/tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
/tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
failed to map segment from shared object: Operation not permitted
at
org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:156)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at
java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72)
at
org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
at org.apache.spark.broadcast.TorrentBroadcast.org
$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
at
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80)
at
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at
org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1014)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1011)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at
org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at
org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1011)
at
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
at
org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at
org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at
org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:36)
at $iwC$$iwC$$iwC$$iwC.(:38)
at $iwC$$iwC$$iwC.(:40)
at $iwC$$iwC.(:42)
at $iwC.(:44)
at (:46)
at .(:50)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at
org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at
org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at
org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org

which is better RDD or Dataframe?

2016-02-15 Thread Divya Gehlot
Hi,
I would like to know which gives better performance RDDs or dataframes ?
Like for one scenario :
1.Read the file as RDD and register as temp table and fire SQL query

 2.Read the file through Dataframe API or convert the RDD to dataframe and
use dataframe APIs to process the data.

For the scenario like above which gives better performance.
Does any body have benchmark or statistical data regarding that ?


Thanks,
Divya


Spark on Windows

2016-02-15 Thread KhajaAsmath Mohammed
Hi,

I am new to spark and starting working on it by writing small programs. I
am able to run those in cloudera quickstart VM but not able to run in the
eclipse when giving master URL

*Steps I perfromed:*

Started Master and can access it through http://localhost:8080

Started worker and access it.

Ran the wordcount by giving master as spark://localhost:7077 but no output
and I cant see the application Id also in master web UI.

I tried with master as local and was able to run successfully. I want to
run on the master so that I can view logs in master and worker. any
suggestions for this?

Thanks,
Asmath


Re: How to run Scala file examples in spark 1.5.2

2016-02-15 Thread Ted Yu
bq. 150.142.11

The address above seem to be missing one octet.

bq. org.apache.spark.examples/HdfsTest

The slash should be a dot.

Cheers



On Mon, Feb 15, 2016 at 5:53 PM, Ashok Kumar  wrote:

> Thank you sir it is spark-examples-1.5.2-hadoop2.6.0.jar in mine
>
> Can you please tell me how to run with spark-submit correctly as I did
>
> spark-shell --master spark:///150.142.11:7077 --class
> org.apache.spark.examples/HdfsTest
> $SPARK_HOME/lib/spark-examples-1.5.2-hadoop2.6.0.jar
>
> ava.lang.ClassNotFoundException: org.apache.spark.examples/HdfsTest
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:641)
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> thanks
>
>
>
>
> On Tuesday, 16 February 2016, 1:33, Ted Yu  wrote:
>
>
> Here is the path to the examples jar in 1.6.0 release:
>
> ./lib/spark-examples-1.6.0-hadoop2.6.0.jar
>
> On Mon, Feb 15, 2016 at 5:30 PM, Ted Yu  wrote:
>
> If you don't modify HdfsTest.scala, there is no need to rebuild it - it is
> contained in the examples jar coming with Spark release.
>
> You can use spark-submit to run the example.
>
> Cheers
>
> On Mon, Feb 15, 2016 at 5:24 PM, Ashok Kumar  > wrote:
>
> Gurus,
>
> I am trying to run some examples given under directory examples
>
> spark/examples/src/main/scala/org/apache/spark/examples/
>
> I am trying to run HdfsTest.scala
>
> However, when I run HdfsTest.scala  against spark shell it comes back with
> error
>
> Spark context available as sc.
> SQL context available as sqlContext.
> Loading HdfsTest.scala...
> :19: error: illegal start of definition
>package org.apache.spark.examples
>^
> import org.apache.spark._
> defined module HdfsTest
>
> scala>
>
> Can someone guide me how to run these Ccala codes without errors?  Do I
> need to compile them first with scalac -cp $CLASSPATH HdfsTest.scala
>
> Thanking you
>
>
>
>
>
>


Re: How to run Scala file examples in spark 1.5.2

2016-02-15 Thread Ted Yu
If you don't modify HdfsTest.scala, there is no need to rebuild it - it is
contained in the examples jar coming with Spark release.

You can use spark-submit to run the example.

Cheers

On Mon, Feb 15, 2016 at 5:24 PM, Ashok Kumar 
wrote:

> Gurus,
>
> I am trying to run some examples given under directory examples
>
> spark/examples/src/main/scala/org/apache/spark/examples/
>
> I am trying to run HdfsTest.scala
>
> However, when I run HdfsTest.scala  against spark shell it comes back with
> error
>
> Spark context available as sc.
> SQL context available as sqlContext.
> Loading HdfsTest.scala...
> :19: error: illegal start of definition
>package org.apache.spark.examples
>^
> import org.apache.spark._
> defined module HdfsTest
>
> scala>
>
> Can someone guide me how to run these Ccala codes without errors?  Do I
> need to compile them first with scalac -cp $CLASSPATH HdfsTest.scala
>
> Thanking you
>


Re: Text search in Spark on compressed bz2 files

2016-02-15 Thread Mich Talebzadeh
 

On 16/02/2016 00:02, Mich Talebzadeh wrote: 

> Hi 
> 
> It does not seem that sc.textFile supports search on log files compressed 
> with bzip2 
> 
> val logfile2 = sc.textFile("hdfs://rhes564:9000/test/REP_*.log.bz2") 
> 
> val df2 = logfile2.toDF("line")
> val errors2 = df2.filter(col("line").contains("E."))
> errors2.count() 
> 
> Nothing is returned. Is there as method call to read a compressed file? 
> 
> Thanks. 
> -- 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

How to run Scala file examples in spark 1.5.2

2016-02-15 Thread Ashok Kumar
 Gurus,
I am trying to run some examples given under directory examples
spark/examples/src/main/scala/org/apache/spark/examples/
I am trying to run HdfsTest.scala 
However, when I run HdfsTest.scala  against spark shell it comes back with error
Spark context available as sc.
SQL context available as sqlContext.
Loading HdfsTest.scala...
:19: error: illegal start of definition
   package org.apache.spark.examples
   ^
import org.apache.spark._
defined module HdfsTest
scala>
Can someone guide me how to run these Ccala codes without errors?  Do I need to 
compile them first with scalac -cp $CLASSPATH HdfsTest.scala  
Thanking you


Re: Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
I found my problem. I was calling setParameterValue(defaultValue) more than
one time in the hierarchy of my classes.




Thanks!

On Mon, Feb 15, 2016 at 6:34 PM, Cesar Flores  wrote:

>
> I have a set of transformers (each with specific parameters) in spark
> 1.3.1. I have two versions, one that works and one that does not:
>
> 1.- working version
> //featureprovidertransformer contains already a set of ml params
> class DemographicTransformer(override val uid: String) extends
> FeatureProviderTransformer {
>
>   def this() = this(Identifiable.randomUID("demo-transformer"))
>   override def copy(extra: ParamMap): DemographicTransformer =
> defaultCopy(extra)
>
>   
>
> }
>
> 2.- not working version
> class DemographicTransformer(override val uid: String) extends
> FeatureProviderTransformer {
>
>   def this() = this(Identifiable.randomUID("demo-transformer"))
>   override def copy(extra: ParamMap): DemographicTransformer =
> defaultCopy(extra)
>
>   *//add another transformer parameter*
>   final val anotherParam: Param[String] = new Param[String](this,
> "anotherParam", "dummy parameter")
>   
>
> }
>
> Somehow adding an *anotherParam* to my class make it fail, with the
> following error:
>
> [info]   java.lang.NullPointerException:
> [info]   at
> org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
> [info]   at
> org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
> [info]   at
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
> [info]   at
> scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
> [info]   at
> scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:189)
> [info]   at
> scala.collection.mutable.ArrayOps$ofRef.segmentLength(ArrayOps.scala:108)
> [info]   at
> scala.collection.GenSeqLike$class.prefixLength(GenSeqLike.scala:92)
> [info]   at
> scala.collection.mutable.ArrayOps$ofRef.prefixLength(ArrayOps.scala:108)
> [info]   at
> scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:40)
> [info]   at
> scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:108)
>
> Debugging the params.scala class shows me that actually adding
> *anotherParam*  *replace all parameters by a single one called allParams.*
>
> *Does anyone have any idea of what I may be doing wrong. My guess is that
> I am doing something weird in my class hierarchy but can not figure out
> what.*
>
>
> Thanks!
> --
> Cesar Flores
>



-- 
Cesar Flores


Migrating Transformers from Spark 1.3.1 to 1.5.0

2016-02-15 Thread Cesar Flores
I have a set of transformers (each with specific parameters) in spark
1.3.1. I have two versions, one that works and one that does not:

1.- working version
//featureprovidertransformer contains already a set of ml params
class DemographicTransformer(override val uid: String) extends
FeatureProviderTransformer {

  def this() = this(Identifiable.randomUID("demo-transformer"))
  override def copy(extra: ParamMap): DemographicTransformer =
defaultCopy(extra)

  

}

2.- not working version
class DemographicTransformer(override val uid: String) extends
FeatureProviderTransformer {

  def this() = this(Identifiable.randomUID("demo-transformer"))
  override def copy(extra: ParamMap): DemographicTransformer =
defaultCopy(extra)

  *//add another transformer parameter*
  final val anotherParam: Param[String] = new Param[String](this,
"anotherParam", "dummy parameter")
  

}

Somehow adding an *anotherParam* to my class make it fail, with the
following error:

[info]   java.lang.NullPointerException:
[info]   at
org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
[info]   at
org.apache.spark.ml.param.Params$$anonfun$hasParam$1.apply(params.scala:408)
[info]   at
scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
[info]   at
scala.collection.IndexedSeqOptimized$$anonfun$exists$1.apply(IndexedSeqOptimized.scala:40)
[info]   at
scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:189)
[info]   at
scala.collection.mutable.ArrayOps$ofRef.segmentLength(ArrayOps.scala:108)
[info]   at
scala.collection.GenSeqLike$class.prefixLength(GenSeqLike.scala:92)
[info]   at
scala.collection.mutable.ArrayOps$ofRef.prefixLength(ArrayOps.scala:108)
[info]   at
scala.collection.IndexedSeqOptimized$class.exists(IndexedSeqOptimized.scala:40)
[info]   at
scala.collection.mutable.ArrayOps$ofRef.exists(ArrayOps.scala:108)

Debugging the params.scala class shows me that actually adding
*anotherParam*  *replace all parameters by a single one called allParams.*

*Does anyone have any idea of what I may be doing wrong. My guess is that I
am doing something weird in my class hierarchy but can not figure out what.*


Thanks!
-- 
Cesar Flores


Re: Dataset takes more memory compared to RDD

2016-02-15 Thread Michael Armbrust
What algorithm? Can you provide code?

On Fri, Feb 12, 2016 at 3:22 PM, Raghava Mutharaju <
m.vijayaragh...@gmail.com> wrote:

> Hello All,
>
> I implemented an algorithm using both the RDDs and the Dataset API (in
> Spark 1.6). Dataset version takes lot more memory than the RDDs. Is this
> normal? Even for very small input data, it is running out of memory and I
> get a java heap exception.
>
> I tried the Kryo serializer by registering the classes and I
> set spark.kryo.registrationRequired to true. I get the following exception
>
> com.esotericsoftware.kryo.KryoException:
> java.lang.IllegalArgumentException: Class is not registered:
> org.apache.spark.sql.types.StructField[]
> Note: To register this class use:
> kryo.register(org.apache.spark.sql.types.StructField[].class);
>
> I tried registering
> using conf.registerKryoClasses(Array(classOf[StructField[]]))
>
> But StructField[] does not exist. Is there any other way to register it? I
> already registered StructField.
>
> Regards,
> Raghava.
>


RE: Check if column exists in Schema

2016-02-15 Thread Mohammed Guller
The DataFrame class has a method named columns, which returns all column names 
as an array. You can then use the contains method in the Scala Array class to 
check whether a column exists.

Mohammed
Author: Big Data Analytics with 
Spark

From: Sebastian Piu [mailto:sebastian@gmail.com]
Sent: Monday, February 15, 2016 11:21 AM
To: user
Subject: Re: Check if column exists in Schema

I just realised this is a bit vague, I'm looking to create a function that 
looks into different columns to get a value. So depending on a type I might 
look into a given path or another (which might or might not exist).

Example if column some.path.to.my.date exists I'd return that, if it doesn't or 
it is null, i'd get it from some other place

On Mon, Feb 15, 2016 at 7:17 PM Sebastian Piu 
> wrote:
Is there any way of checking if a given column exists in a Dataframe?


Out of Memory error caused by output object in mapPartitions

2016-02-15 Thread nitinkak001
My mapPartition code as given below outputs one record for each input record.
So, the output object has equal number of records as input. I am loading the
output data into a listbuffer object. This object is turning out to be too
huge for memory leading to Out Of Memory exception. 

To be more clear my logic of partition is as below:

*Iterator(Iter1) -> Processing  -> ListBuffer(list1)

iter1.size() = list1.size()
list1 goes out of memory*

*I cannot change the partition size.* My parition is based on input key and
all the records corresponding to a key need to go into same partition. Is
there a workaround to this?

/   tempRDD = iterateRDD.mapPartitions(p => {
var outputList = ListBuffer[String]()
var minVal = 0L
while (p.hasNext) {
val tpl = p.next()
val key = tpl._1
val value = tpl._2
if(key != prevKey){
  if(value < key){
minVal = value;
outputList.add(minVal.toString() + "\t" +key.toString())
  }else{
minVal = key;
outputList.add(minVal.toString() + "\t" +value.toString())
  }
}else{
  outputList.add(minVal.toString() + "\t" +value.toString())
}
prevKey = key;
}
outputList.iterator
  })/






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-Memory-error-caused-by-output-object-in-mapPartitions-tp26229.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Working out the optimizer matrix in Spark

2016-02-15 Thread Mich Talebzadeh
 

Hi, 

I would like to know if there are commands available with spark to allow
to see all active processes plus the details of each process. 

FYI I am aware of cluster information in Spark GUI on port 4040. What I
am specifically looking is details from the optimiser itself, the
physical and logical IOs and locks on tables 

For example if the locks held much like Hive command that show locks 

Lock ID Database Table Partition State Type Transaction ID Last Hearbeat
Acquired At User Hostname
6077 oraclehadoop sales_staging NULL ACQUIRED SHARED_READ NULL
1455573580652 1455573578990 hduser rhes564 

Thanks, 
-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Eduardo Costa Alfaia
Hi Gourav,

I did a prove as you said, for me it’s working, I am using spark in local mode, 
master and worker in the same machine. I run the example in spark-shell 
—package com.databricks:spark-csv_2.10:1.3.0 without errors.

BR

From:  Gourav Sengupta 
Date:  Monday, February 15, 2016 at 10:03
To:  Jorge Machado 
Cc:  Spark Group 
Subject:  Re: Using SPARK packages in Spark Cluster

Hi Jorge/ All,

Please please please go through this link  
http://spark.apache.org/docs/latest/spark-standalone.html. 
The link tells you how to start a SPARK cluster in local mode. If you have not 
started or worked in SPARK cluster in local mode kindly do not attempt in 
answering this question.

My question is how to use packages like  
https://github.com/databricks/spark-csv when I using SPARK cluster in local 
mode.

Regards,
Gourav Sengupta


On Mon, Feb 15, 2016 at 1:55 PM, Jorge Machado  wrote:
Hi Gourav, 

I did not unterstand your problem… the - - packages  command should not make 
any difference if you are running standalone or in YARN for example.  
Give us an example what packages are you trying to load, and what error are you 
getting…  If you want to use the libraries in spark-packages.org without the 
--packages why do you not use maven ? 
Regards 


On 12/02/2016, at 13:22, Gourav Sengupta  wrote:

Hi,

I am creating sparkcontext in a SPARK standalone cluster as mentioned here: 
http://spark.apache.org/docs/latest/spark-standalone.html using the following 
code:

--
sc.stop()
conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \
  .setMaster("spark://hostname:7077") \
  .set('spark.shuffle.service.enabled', True) \
  .set('spark.dynamicAllocation.enabled','true') \
  .set('spark.executor.memory','20g') \
  .set('spark.driver.memory', '4g') \
  .set('spark.default.parallelism',(multiprocessing.cpu_count() 
-1 ))
conf.getAll()
sc = SparkContext(conf = conf)

-(we should definitely be able to optimise the configuration but that is 
not the point here) ---

I am not able to use packages, a list of which is mentioned here 
http://spark-packages.org, using this method. 

Where as if I use the standard "pyspark --packages" option then the packages 
load just fine.

I will be grateful if someone could kindly let me know how to load packages 
when starting a cluster as mentioned above.


Regards,
Gourav Sengupta




-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: caching ratigs with ALS implicit

2016-02-15 Thread Sean Owen
It will need its intermediate RDDs to be cached, and it will do that
internally. See the setIntermediateRDDStorageLevel method. Skim the
API docs too.

On Mon, Feb 15, 2016 at 9:21 PM, Roberto Pagliari
 wrote:
> Something not clear from the documentation is weather the ratings RDD needs
> to be cached before calling ALS trainImplicit. Would there be any
> performance gain?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: recommendations with duplicate ratings

2016-02-15 Thread Sean Owen
You're asking what happens when you put many ratings for one user-item
pair in the input, right? I'm saying you shouldn't do that --
aggregate them into one pair in your application.

For rating-like (explicit) data, it doesn't really make sense
otherwise. The only sensible aggregation is last-first, but there's no
natural notion of 'last' in the RDD you supply.

For count-like (implicit) data, it makes sense to sum the inputs, but
I don't think that is done automatically. I skimmed the code and
didn't see it. So you would sum the values per user-item anyway.

On Mon, Feb 15, 2016 at 9:05 PM, Roberto Pagliari
 wrote:
> Hi Sean,
> I¹m not sure what you mean by aggregate. The input of trainImplicit is an
> RDD of Ratings.
>
> I find it odd that duplicate ratings would mess with ALS in the implicit
> case. It¹d be nice if it didn¹t.
>
>
> Thank you,
>
> On 15/02/2016 20:49, "Sean Owen"  wrote:
>
>>I believe you need to aggregate inputs per user-item in your call. I
>>am actually not sure what happens if you don't. I think it would
>>compute the factors twice and one would win, so yes I think it would
>>effectively be ignored.  For implicit, that wouldn't work correctly,
>>so you do need to aggregate.
>>
>>On Mon, Feb 15, 2016 at 8:30 PM, Roberto Pagliari
>> wrote:
>>> What happens when duplicate user/ratings are fed into ALS (the implicit
>>> version, specifically)? Are duplicates ignored?
>>>
>>> I¹m asking because that would save me a distinct.
>>>
>>>
>>>
>>> Thank you,
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



caching ratigs with ALS implicit

2016-02-15 Thread Roberto Pagliari
Something not clear from the documentation is weather the ratings RDD needs to 
be cached before calling ALS trainImplicit. Would there be any performance gain?


Re: recommendations with duplicate ratings

2016-02-15 Thread Roberto Pagliari
Hi Sean,
I¹m not sure what you mean by aggregate. The input of trainImplicit is an
RDD of Ratings. 

I find it odd that duplicate ratings would mess with ALS in the implicit
case. It¹d be nice if it didn¹t.


Thank you, 

On 15/02/2016 20:49, "Sean Owen"  wrote:

>I believe you need to aggregate inputs per user-item in your call. I
>am actually not sure what happens if you don't. I think it would
>compute the factors twice and one would win, so yes I think it would
>effectively be ignored.  For implicit, that wouldn't work correctly,
>so you do need to aggregate.
>
>On Mon, Feb 15, 2016 at 8:30 PM, Roberto Pagliari
> wrote:
>> What happens when duplicate user/ratings are fed into ALS (the implicit
>> version, specifically)? Are duplicates ignored?
>>
>> I¹m asking because that would save me a distinct.
>>
>>
>>
>> Thank you,
>>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: recommendations with duplicate ratings

2016-02-15 Thread Sean Owen
I believe you need to aggregate inputs per user-item in your call. I
am actually not sure what happens if you don't. I think it would
compute the factors twice and one would win, so yes I think it would
effectively be ignored.  For implicit, that wouldn't work correctly,
so you do need to aggregate.

On Mon, Feb 15, 2016 at 8:30 PM, Roberto Pagliari
 wrote:
> What happens when duplicate user/ratings are fed into ALS (the implicit
> version, specifically)? Are duplicates ignored?
>
> I’m asking because that would save me a distinct.
>
>
>
> Thank you,
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



recommendations with duplicate ratings

2016-02-15 Thread Roberto Pagliari
What happens when duplicate user/ratings are fed into ALS (the implicit 
version, specifically)? Are duplicates ignored?

I'm asking because that would save me a distinct.



Thank you,



Re: temporary tables created by registerTempTable()

2016-02-15 Thread Mich Talebzadeh
 

Hi Michael, 

A temporary table in Hive is private to the session that created that
table itself within the lifetime of that session. The table is created
in the same database (in this case oraclehadoop.db) "first" and then
moved to /tmp directory in hdfs in _tmp_space_db directory
(actually_tmp_space.db) as shown below 

INFO : Moving data 

 to:
hdfs://rhes564:9000/tmp/hive/hduser/d762c796-5953-478e-98bd-d51f13ddadbf/_tmp_space.db/7972610a-3088-4535-ad25-af2c9cfce6b2


 from 

hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/.HIVE-STAGING_HIVE_2016-02-15_19-34-08_358_5135480012597277692-4/-ext-10002


Example creating a temporary Hive table called tmp 

0: jdbc:hive2://rhes564:10010/default> CREATE TEMPORARY TABLE tmp AS
0: jdbc:hive2://rhes564:10010/default> SELECT t.calendar_month_desc,
c.channel_desc, SUM(s.amount_sold) AS TotalSales
0: jdbc:hive2://rhes564:10010/default> FROM sales s, times t, channels c
0: jdbc:hive2://rhes564:10010/default> WHERE s.time_id = t.time_id
0: jdbc:hive2://rhes564:10010/default> AND s.channel_id = c.channel_id
0: jdbc:hive2://rhes564:10010/default> GROUP BY t.calendar_month_desc,
c.channel_desc
0: jdbc:hive2://rhes564:10010/default> ;
INFO : Ended Job = job_1455564800116_0001
INFO : Moving data to:
hdfs://rhes564:9000/tmp/hive/hduser/d762c796-5953-478e-98bd-d51f13ddadbf/_tmp_space.db/7972610a-3088-4535-ad25-af2c9cfce6b2
from
hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/.hive-staging_hive_2016-02-15_19-34-08_358_5135480012597277692-4/-ext-10002
INFO : Table oraclehadoop.tmp stats: [numFiles=1, numRows=150,
totalSize=3934, rawDataSize=3784] 

Other sessions will not see that table and can create a temp table with
the same name as well as I did with the first session still open 

INFO : Moving data 

 to:
hdfs://rhes564:9000/tmp/hive/hduser/65f3317b-1341-40b8-86c3-f0b8a9c02ad6/_tmp_space.db/cd0c783d-8966-45bb-8f7a-5059159cbf8d


 from 

hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/.HIVE-STAGING_HIVE_2016-02-15_20-04-57_780_9035563040280767266-11/-ext-10002

INFO : Table oraclehadoop.tmp stats: [numFiles=1, numRows=150,
totalSize=3934, rawDataSize=3784] 

note that each temp table file is time stamped (shown in bold) so there
will not be any collision and temp table will disappear after the
session is complete. 

HTH 

Mich 

On 15/02/2016 18:41, Michael Segel wrote: 

> I was just looking at that... 
> 
> Out of curiosity... if you make it a Hive Temp Table... who has access to the 
> data? 
> 
> Just your app, or anyone with access to the same database? (Would you be able 
> to share data across different JVMs? ) 
> 
> (E.G - I have a reader who reads from source A that needs to publish the data 
> to a bunch of minions (B) ) 
> 
> Would this be an option? 
> 
> Thx 
> 
> -Mike 
> 
> On Feb 15, 2016, at 7:54 AM, Mich Talebzadeh 
>  wrote: 
> 
> Hi, 
> 
> It is my understanding that the registered temporary tables created by 
> registerTempTable() used in Spark shell built on ORC files? 
> 
> For example the following Data Frame just creates a logical abstraction 
> 
> scala> var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM 
> oraclehadoop.sales")
> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: 
> timestamp, CHANNEL_ID: bigint] 
> 
> Then I registar this data frame as temporary table using registerTempTable() 
> call 
> 
> s.registerTempTable("t_s") 
> 
> Also I believe that s.registerTempTable("t_s") creates an in-memory table 
> that is scoped to the cluster in which it was created. The data is stored 
> using Hive's ORC format and this tempTable is stored in memory on all nodes 
> of the cluster? In other words every node in the cluster has a copy of 
> tempTable in its memory? 
> 
> Thanks, 
> 
> -- 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  [1]
> 
> http://talebzadehmich.wordpress.com [2]
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
> 
> -- 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  [1]
> 
> http://talebzadehmich.wordpress.com [2]
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any 

How to partition a dataframe based on an Id?

2016-02-15 Thread SRK
Hi,

How to partition a dataframe of User Objects based on an Id so that I can
both do a join on an Id  and also retrieve all the user objects in between a
time period when queried?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-partition-a-dataframe-based-on-an-Id-tp26228.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Is predicate push-down supported by default in dataframes?

2016-02-15 Thread SRK
Hi,

Is predicate push down supported by default in dataframes or is it dependent
on the format in which the dataframes is stored like Parquet?  

Thanks,
Swetha



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-predicate-push-down-supported-by-default-in-dataframes-tp26227.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[ANNOUNCE] Apache SystemML 0.9.0-incubating released

2016-02-15 Thread Luciano Resende
The Apache SystemML team is pleased to announce the release of Apache
SystemML version 0.9.0-incubating. This is the first release as an Apache
project.

Apache SystemML provides declarative large-scale machine learning (ML) that
aims at flexible specification of ML algorithms and automatic generation of
hybrid runtime plans ranging from single-node, in-memory computations, to
distributed computations on Apache Hadoop MapReduce and Apache Spark.

Extensive updates have been made to the release in several areas. These
include APIs, data ingestion, optimizations, language and runtime
operators, new algorithms, testing, and online documentation.

*APIs*

Improvements to MLContext and to MLPipeline wrappers

*Data Ingestion*

Data conversion utilities (from RDDs and DataFrames)
Data transformations on raw data sets

*Optimizations*

Extensions to compilation chain, including IPA
Improvements to parfor
Improved execution of concurrent Spark jobs
New rewrites, including eager RDD caching and repartitioning
Improvements to buffer pool caching
Partitioning-preserving operations
On-demand creation of SparkContext
Efficient use of RDD checkpointing

*Language and Runtime Operators*

New matrix multiplication operators (e.g., ZipMM)
New multi-threaded readers and operators
Extended aggregation-outer operations for different relational operators
Sample capability

*New Algorithms*

Alternating Least Squares (Conjugate Gradient)
Cubic Splines (Conjugate Gradient and Direct Solve)

*Testing*

PyDML algorithm tests
Test suite refactoring
Improvements to performance tests

*Online Documentation*

GitHub README
Quick Start Guide
DML and PyDML Programming Guide
MLContext Programming Guide
Algorithms Reference
DML Language Reference
Debugger Guide


To download the distribution, please go to :

http://systemml.apache.org/

The Apache SystemML Team

---
Apache SystemML is an effort undergoing Incubation
 at The Apache Software Foundation
(ASF), sponsored by the Incubator. Incubation is required of all newly
accepted projects until a further review indicates that the infrastructure,
communications, and decision making process have stabilized in a manner
consistent with other successful ASF projects. While incubation status is
not necessarily a reflection of the completeness or stability of the code,
it does indicate that the project has yet to be fully endorsed by the ASF.


Re: How to join an RDD with a hive table?

2016-02-15 Thread swetha kasireddy
How about saving the dataframe as a table partitioned by userId? My User
records have userId, number of sessions, visit count etc as the columns and
it should be partitioned by userId. I will need to join the userTable saved
in the database as follows with an incoming session RDD. The session RDD
would have a sessionId and  a sessionRecord which has the userId. So,
 saving the user  data as a table using dataframes partitioned by userId
and then joining it with session RDDs, needs to be done.  How can I join a
dataframe saved in hdfs with an incoming RDD so that all the records are
not read and only the records for which the join conditions are met are
read?

df.write.partitionBy('userId').saveAsTable(...)


On Mon, Feb 15, 2016 at 10:09 AM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

>
>
> It depends on how many columns you need from tables for your queries and
> potential number of rows.
>
> From my experience I don't believe that registering a table as temporary
> means it is going to cache whole 1 billion rows into memory. That does not
> make sense (I stand corrected). Only a fraction of rows and columns will be
> needed.
>
> It will be interesting to know how Catalyst is handling this. I suspect it
> behaves much like any data cache in a relational database by having some
> form of MRU-LRU chain where rows are read into memory from the blocks,
> processed and discarded to make room for new ones. If the memory is not big
> enough the operation is spilled to disk.
>
> I just did a test on three tables in Hive with Spark 15.2 using Data
> Frames and tempTables
>
> The FACT table had 1 billion rows as follows:
>
>
> ++--+
> | CREATE TABLE
> `sales_staging`(  |
> |   `prod_id`
> bigint,|
> |   `cust_id`
> bigint,|
> |   `time_id`
> timestamp, |
> |   `channel_id`
> bigint, |
> |   `promo_id`
> bigint,   |
> |   `quantity_sold`
> decimal(10,0),   |
> |   `amount_sold`
> decimal(10,0)) |
> | ROW FORMAT
> SERDE   |
> |
> 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' |
> | STORED AS
> INPUTFORMAT  |
> |
> 'org.apache.hadoop.mapred.TextInputFormat'   |
> |
> OUTPUTFORMAT   |
> |
> 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
> |
> LOCATION   |
> |
> 'hdfs://rhes564:9000/user/hive/warehouse/oraclehadoop.db/sales_staging'  |
> | TBLPROPERTIES
> (|
> |
> 'COLUMN_STATS_ACCURATE'='true',  |
> |
> 'last_modified_by'='hduser', |
> |
> 'last_modified_time'='1451305601',   |
> |
> 'numFiles'='4',  |
> |   'numRows'='10',
> |
> |
> 'rawDataSize'='46661545000', |
> |
> 'totalSize'='47661545000',   |
> |
> 'transient_lastDdlTime'='1451767601')|
>
>
>
> The other dimension tables were tiny. It took 13 minutes to get the first
> 10 rows back but only requiring few columns of interest. So I don't think
> it was loading 1 billion rows into memory from sales_staging table
>
> Started at
>
> [15/02/2016 17:47:28.28]
>
> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID:
> timestamp, CHANNEL_ID: bigint]
>
> c: org.apache.spark.sql.DataFrame = [CHANNEL_ID: double, CHANNEL_DESC:
> string]
>
> t: org.apache.spark.sql.DataFrame = [TIME_ID: timestamp,
> CALENDAR_MONTH_DESC: string]
>
> sqltext: String = ""
>
> sqltext: String =
>
> SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)
>
> FROM
>
> (
>
> SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
> SUM(t_s.AMOUNT_SOLD) AS TotalSales
>
> FROM t_s, t_t, t_c
>
> WHERE t_s.TIME_ID = t_t.TIME_ID
>
> AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID
>
> GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 10
>
>
>
> [1998-01,Direct Sales,1823005210]
>
> [1998-01,Internet,248172522]
>
> [1998-01,Partners,474646900]
>
> [1998-02,Direct Sales,1819659036]
>
> [1998-02,Internet,298586496]
>
> 

Re: Check if column exists in Schema

2016-02-15 Thread Sebastian Piu
I just realised this is a bit vague, I'm looking to create a function that
looks into different columns to get a value. So depending on a type I might
look into a given path or another (which might or might not exist).

Example if column *some.path.to.my.date *exists I'd return that, if it
doesn't or it is null, i'd get it from some other place

On Mon, Feb 15, 2016 at 7:17 PM Sebastian Piu 
wrote:

> Is there any way of checking if a given column exists in a Dataframe?
>


Check if column exists in Schema

2016-02-15 Thread Sebastian Piu
Is there any way of checking if a given column exists in a Dataframe?


Re: temporary tables created by registerTempTable()

2016-02-15 Thread Michael Segel
I was just looking at that… 

Out of curiosity… if you make it a Hive Temp Table… who has access to the data? 

Just your app, or anyone with access to the same database?  (Would you be able 
to share data across different JVMs? ) 

(E.G - I have a reader who reads from source A that needs to publish the data 
to a bunch of minions (B)   ) 

Would this be an option? 

Thx

-Mike

> On Feb 15, 2016, at 7:54 AM, Mich Talebzadeh 
>  wrote:
> 
>> Hi,
>> 
>>  
>> It is my understanding that the registered temporary tables created by 
>> registerTempTable() used in Spark shell built on ORC files?
>> 
>> For example the following Data Frame just creates a logical abstraction
>> 
>> scala> var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM 
>> oraclehadoop.sales")
>> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: 
>> timestamp, CHANNEL_ID: bigint] 
>> 
>> Then I registar this data frame as temporary table using registerTempTable() 
>> call
>> 
>> s.registerTempTable("t_s")
>> 
>> Also I believe that s.registerTempTable("t_s") creates an in-memory table 
>> that is scoped to the cluster in which it was created. The data is stored 
>> using Hive's ORC format and this tempTable is stored in memory on all nodes 
>> of the cluster?  In other words every node in the cluster has a copy of 
>> tempTable in its memory?
>> 
>> Thanks,
>> 
>>  
>> -- 
>> Dr Mich Talebzadeh
>> 
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> 
>> http://talebzadehmich.wordpress.com
>> 
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Cloud Technology 
>> Partners Ltd, its subsidiaries or their employees, unless expressly so 
>> stated. It is the responsibility of the recipient to ensure that this email 
>> is virus free, therefore neither Cloud Technology partners Ltd, its 
>> subsidiaries nor their employees accept any responsibility.
>> 
>  
>  
> -- 
> Dr Mich Talebzadeh
> 
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
> 



Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-15 Thread Abhishek Anand
I am now trying to use mapWithState in the following way using some example
codes. But, by looking at the DAG it does not seem to checkpoint the state
and when restarting the application from checkpoint, it re-partitions all
the previous batches data from kafka.

static Function3> mappingFunc =
new Function3>() {
@Override
public Tuple2 call(String key, Optional one,
State state) {
MyClass nullObj = new MyClass();
nullObj.setImprLog(null);
nullObj.setNotifyLog(null);
MyClass current = one.or(nullObj);

if(current!= null && current.getImprLog() != null &&
current.getMyClassType() == 1){
return new Tuple2<>(key, null);
}
else if (current.getNotifyLog() != null  && current.getMyClassType() == 3){
MyClass oldState = (state.exists() ? state.get() : nullObj);
if(oldState!= null && oldState.getNotifyLog() != null){
oldState.getImprLog().setCrtd(current.getNotifyLog().getCrtd());
return new Tuple2<>(key, oldState);
}
else{
return new Tuple2<>(key, null);
}
}
else{
return new Tuple2<>(key, null);
}

}
};


Please suggest if this is the proper way or am I doing something wrong.


Thanks !!
Abhi

On Sun, Feb 14, 2016 at 2:26 AM, Sebastian Piu 
wrote:

> If you don't want to update your only option will be updateStateByKey then
> On 13 Feb 2016 8:48 p.m., "Ted Yu"  wrote:
>
>> mapWithState supports checkpoint.
>>
>> There has been some bug fix since release of 1.6.0
>> e.g.
>>   SPARK-12591 NullPointerException using checkpointed mapWithState with
>> KryoSerializer
>>
>> which is in the upcoming 1.6.1
>>
>> Cheers
>>
>> On Sat, Feb 13, 2016 at 12:05 PM, Abhishek Anand > > wrote:
>>
>>> Does mapWithState checkpoints the data ?
>>>
>>> When my application goes down and is restarted from checkpoint, will
>>> mapWithState need to recompute the previous batches data ?
>>>
>>> Also, to use mapWithState I will need to upgrade my application as I am
>>> using version 1.4.0 and mapWithState isnt supported there. Is there any
>>> other work around ?
>>>
>>> Cheers!!
>>> Abhi
>>>
>>> On Fri, Feb 12, 2016 at 3:33 AM, Sebastian Piu 
>>> wrote:
>>>
 Looks like mapWithState could help you?
 On 11 Feb 2016 8:40 p.m., "Abhishek Anand" 
 wrote:

> Hi All,
>
> I have an use case like follows in my production environment where I
> am listening from kafka with slideInterval of 1 min and windowLength of 2
> hours.
>
> I have a JavaPairDStream where for each key I am getting the same key
> but with different value,which might appear in the same batch or some next
> batch.
>
> When the key appears second time I need to update a field in value of
> previous key with a field in the later key. The keys for which the
> combination keys do not come should be rejected after 2 hours.
>
> At the end of each second I need to output the result to external
> database.
>
> For example :
>
> Suppose valueX is object of MyClass with fields int a, String b
> At t=1sec I am getting
> key0,value0(0,"prev0")
> key1,value1 (1, "prev1")
> key2,value2 (2,"prev2")
> key2,value3 (3, "next2")
>
> Output to database after 1 sec
> key2, newValue (2,"next2")
>
> At t=2 sec getting
> key3,value4(4,"prev3")
> key1,value5(5,"next1")
>
> Output to database after 2 sec
> key1,newValue(1,"next1")
>
> At t=3 sec
> key4,value6(6,"prev4")
> key3,value7(7,"next3")
> key5,value5(8,"prev5")
> key5,value5(9,"next5")
> key0,value0(10,"next0")
>
> Output to database after 3 sec
> key0,newValue(0,"next0")
> key3,newValue(4,"next3")
> key5,newValue(8,"next5")
>
>
> Please suggest how this can be achieved.
>
>
> Thanks a lot 
> Abhi
>
>
>
>>>
>>


Re: How to join an RDD with a hive table?

2016-02-15 Thread swetha kasireddy
OK. would it only query for the records that I want in hive as per filter
or just load the entire table? My user table will have millions of records
and I do not want to cause OOM errors by loading the entire table in memory.

On Mon, Feb 15, 2016 at 12:51 AM, Mich Talebzadeh 
wrote:

> Also worthwhile using temporary tables for the joint query.
>
>
>
> I can join a Hive table with any other JDBC accessed table from any other
> databases with DF and temporary tables
>
>
>
> //
>
> //Get the FACT table from Hive
>
> //
>
> var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM
> oraclehadoop.sales")
>
>
>
> //
>
> //Get the Dimension table from Oracle via JDBC
>
> //
>
> val c = HiveContext.load("jdbc",
>
> Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",
>
> "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM
> sh.channels)",
>
> "user" -> "sh",
>
> "password" -> "xxx"))
>
>
>
>
>
> s.registerTempTable("t_s")
>
> c.registerTempTable("t_c")
>
>
>
> And do the join
>
>
>
> SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)
>
> FROM
>
> (
>
> SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel,
> SUM(t_s.AMOUNT_SOLD) AS TotalSales
>
> FROM t_s, t_t, t_c
>
> WHERE t_s.TIME_ID = t_t.TIME_ID
>
> AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID
>
> GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC
>
> ) rs
>
> LIMIT 1000
>
> """
>
> HiveContext.sql(sqltext).collect.foreach(println)
>
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
> *From:* Ted Yu [mailto:yuzhih...@gmail.com]
> *Sent:* 15 February 2016 08:44
> *To:* SRK 
> *Cc:* user 
> *Subject:* Re: How to join an RDD with a hive table?
>
>
>
> Have you tried creating a DataFrame from the RDD and join with DataFrame
> which corresponds to the hive table ?
>
>
>
> On Sun, Feb 14, 2016 at 9:53 PM, SRK  wrote:
>
> Hi,
>
> How to join an RDD with a hive table and retrieve only the records that I
> am
> interested. Suppose, I have an RDD that has 1000 records and there is a
> Hive
> table with 100,000 records, I should be able to join the RDD with the hive
> table  by an Id and I should be able to load only those 1000 records from
> Hive table so that are no memory issues. Also, I was planning on storing
> the
> data in hive in the form of parquet files. Any help on this is greatly
> appreciated.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-an-RDD-with-a-hive-table-tp26225.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


Subscribe

2016-02-15 Thread Jayesh Thakrar
Subscribe

SparkListener onApplicationEnd processing an RDD throws exception because of stopped SparkContext

2016-02-15 Thread Sumona Routh
Hi there,
I am trying to implement a listener that performs as a post-processor which
stores data about what was processed or erred. With this, I use an RDD that
may or may not change during the course of the application.

My thought was to use onApplicationEnd and then saveToCassandra call to
persist this.

>From what I've gathered in my experiments,  onApplicationEnd  doesn't get
called until sparkContext.stop() is called. If I don't call stop in my
code, the listener won't be called. This works fine on my local tests -
stop gets called, the listener is called and then persisted to the db, and
everything works fine. However when I run this on our server,  the code in
onApplicationEnd throws the following exception:

Task serialization failed: java.lang.IllegalStateException: Cannot call
methods on a stopped SparkContext

What's the best way to resolve this? I can think of creating a new
SparkContext in the listener (I think I have to turn on allowing multiple
contexts, in case I try to create one before the other one is stopped). It
seems odd but might be doable. Additionally, what if I were to simply add
the code into my job in some sort of procedural block: doJob,
doPostProcessing, does that guarantee postProcessing will occur after the
other?

We are currently using Spark 1.2 standalone at the moment.

Please let me know if you require more details. Thanks for the assistance!
Sumona


Memory problems and missing heartbeats

2016-02-15 Thread JOAQUIN GUANTER GONZALBEZ
Hello,

I am facing in my Project two different issues with Spark that are driving me 
crazy. I am currently running in EMR (Spark 1.5.2 + YARN), using the 
"--executor-memory 40G" option.

Problem #1
=

Some of my processes get killed by YARN because the container is exceeding the 
physical memory YARN assigned it. I have been able to work around this issue by 
increasing the spark.yarn.executor.memoryOverhead parameter to 8G, but that 
doesn't seem like a good solution.

My understanding is that the JVM that will run my Spark process will get 40 GB 
of heap memory (-Xmx40G), and if there is memory pressure in the process then 
the GC should kick in to ensure that the heap never exceeds those 40 GB. My 
PermGen is set to 510MB, but that is a very long way from the 8GB I need to set 
as overhead. This seems to happen when I .cache() very big RDDs and I then 
perform operations that require shuffling (cogroup & co.).

- Who is using all that off heap memory?
- Are there any tools in the Spark ecosystem that might help me debug this?


Problem #2
=

Some tasks fail because the heartbeat didn't get back to the master in 120 
seconds. Again, I can more or less work around this by increasing the timeout 
to 5 minutes, but I don't feel this is addressing the real problem.

- Does the heartbeat have its own thread or would a long-running .map() block 
the heartbeat?
- What conditions would prevent the heartbeat from being sent?

Many thanks in advance for any help with this,
Ximo.



Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede 
contener información privilegiada o confidencial y es para uso exclusivo de la 
persona o entidad de destino. Si no es usted. el destinatario indicado, queda 
notificado de que la lectura, utilización, divulgación y/o copia sin 
autorización puede estar prohibida en virtud de la legislación vigente. Si ha 
recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente 
por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential 
information intended only for the use of the individual or entity named above. 
If the reader of this message is not the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this communication 
is strictly prohibited. If you have received this transmission in error, do not 
read it. Please immediately reply to the sender that you have received this 
communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode 
conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa 
ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica 
notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização 
pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem 
por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e 
proceda a sua destruição

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



More than one StateSpec in the same application

2016-02-15 Thread Udo Fholl
Hi all,

Does StateSpec have their own state or the state is per stream, thus all
StateSpec over the same stream will share the state?

Thanks.

Best regards,
Udo.


Re: temporary tables created by registerTempTable()

2016-02-15 Thread Mich Talebzadeh
 

> Hi, 
> 
> It is my understanding that the registered temporary tables created by 
> registerTempTable() used in Spark shell built on ORC files? 
> 
> For example the following Data Frame just creates a logical abstraction 
> 
> scala> var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM 
> oraclehadoop.sales")
> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: 
> timestamp, CHANNEL_ID: bigint] 
> 
> Then I registar this data frame as temporary table using registerTempTable() 
> call 
> 
> s.registerTempTable("t_s") 
> 
> Also I believe that s.registerTempTable("t_s") creates an in-memory table 
> that is scoped to the cluster in which it was created. The data is stored 
> using Hive's ORC format and this tempTable is stored in memory on all nodes 
> of the cluster? In other words every node in the cluster has a copy of 
> tempTable in its memory? 
> 
> Thanks, 
> 
> -- 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Re: Passing multiple jar files to spark-shell

2016-02-15 Thread Mich Talebzadeh
 

Thanks Ted. I will have a look. 

regards 

On 15/02/2016 14:34, Ted Yu wrote: 

> Mich: 
> You can pass jars for driver using:
> 
> spark.driver.extraClassPath 
> 
> Cheers 
> 
> On Mon, Feb 15, 2016 at 1:05 AM, Mich Talebzadeh  wrote:
> 
> Thanks Deng. Unfortunately it seems that it looks for driver-class-path as 
> well L 
> 
> For example with -jars alone I get 
> 
> spark-shell --master spark://50.140.197.217:7077 [1] --jars 
> /home/hduser/jars/ojdbc6.jar,/home/hduser/jars/jconn4.jar 
> 
> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: 
> timestamp, CHANNEL_ID: bigint] 
> 
> warning: there were 1 deprecation warning(s); re-run with -deprecation for 
> details 
> 
> JAVA.SQL.SQLEXCEPTION: NO SUITABLE DRIVER FOUND FOR 
> JDBC:ORACLE:THIN:@RHES564:1521:MYDB 
> 
> Dr Mich Talebzadeh 
> 
> LinkedIn _ 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  [2]_ 
> 
> http://talebzadehmich.wordpress.com [3] 
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility. 
> 
> FROM: odeach...@gmail.com [mailto:odeach...@gmail.com] ON BEHALF OF Deng 
> Ching-Mallete
> SENT: 15 February 2016 03:27
> TO: Mich Talebzadeh 
> CC: user 
> SUBJECT: Re: Passing multiple jar files to spark-shell 
> 
> Hi Mich, 
> 
> For the --jars parameter, just pass in the jars as comma-delimited. As for 
> the --driver-class-path, make it colon-delimited -- similar to how you set 
> multiple paths for an environment variable (e.g. --driver-class-path 
> /home/hduser/jars/jconn4.jar:/home/hduse/jars/ojdbc6.jar). But if you're 
> already passing the jar files via the --jars param, you shouldn't need to set 
> them in the driver-class-path though, since they should already be 
> automatically added to the classpath. 
> 
> HTH, 
> 
> Deng 
> 
> On Mon, Feb 15, 2016 at 7:35 AM, Mich Talebzadeh  wrote: 
> 
> Hi, 
> 
> Is there anyway one can pass multiple --driver-class-path and multiple -jars 
> to spark shell. 
> 
> For example something as below with two jar files entries for Oracle 
> (ojdbc6.jar) and Sybase IQ (jcoon4,jar) 
> 
> spark-shell --master spark://50.140.197.217:7077 [1] --driver-class-path 
> /home/hduser/jars/jconn4.jar --driver-class-path /home/hduser/jars/ojdbc6.jar 
> --jars /home/hduser/jars/jconn4.jar --jars /home/hduser/jars/ojdbc6.jar 
> 
> This works for one jar file only and you need to add the jar file to both 
> --driver-class-path and -jars. I have not managed to work for more than one 
> type of JAR file 
> 
> Thanks, 
> 
> Dr Mich Talebzadeh 
> 
> LinkedIn _ 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  [2]_ 
> 
> http://talebzadehmich.wordpress.com [3] 
> 
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Links:
--
[1] http://50.140.197.217:7077
[2]
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
[3] http://talebzadehmich.wordpress.com/


Re: Passing multiple jar files to spark-shell

2016-02-15 Thread Ted Yu
Mich:
You can pass jars for driver using:
  spark.driver.extraClassPath

Cheers

On Mon, Feb 15, 2016 at 1:05 AM, Mich Talebzadeh 
wrote:

> Thanks Deng. Unfortunately it seems that it looks for driver-class-path as
> well L
>
>
>
> For example with –jars alone I get
>
>
>
> spark-shell --master spark://50.140.197.217:7077 --jars
> /home/hduser/jars/ojdbc6.jar,/home/hduser/jars/jconn4.jar
>
>
>
>
>
> s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID:
> timestamp, CHANNEL_ID: bigint]
>
> warning: there were 1 deprecation warning(s); re-run with -deprecation for
> details
>
> *java.sql.SQLException: No suitable driver found for
> jdbc:oracle:thin:@rhes564:1521:mydb*
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
> *From:* odeach...@gmail.com [mailto:odeach...@gmail.com] *On Behalf Of *Deng
> Ching-Mallete
> *Sent:* 15 February 2016 03:27
> *To:* Mich Talebzadeh 
> *Cc:* user 
> *Subject:* Re: Passing multiple jar files to spark-shell
>
>
>
> Hi Mich,
>
>
>
> For the --jars parameter, just pass in the jars as comma-delimited. As for
> the --driver-class-path, make it colon-delimited -- similar to how you set
> multiple paths for an environment variable (e.g. --driver-class-path
> /home/hduser/jars/jconn4.jar:/home/hduse/jars/ojdbc6.jar). But if you're
> already passing the jar files via the --jars param, you shouldn't need to
> set them in the driver-class-path though, since they should already be
> automatically added to the classpath.
>
>
>
> HTH,
>
> Deng
>
>
>
> On Mon, Feb 15, 2016 at 7:35 AM, Mich Talebzadeh 
> wrote:
>
> Hi,
>
>
>
> Is there anyway one can pass multiple --driver-class-path and multiple
> –jars to spark shell.
>
>
>
> For example something as below with two jar files entries for Oracle
> (ojdbc6.jar) and Sybase IQ (jcoon4,jar)
>
>
>
> spark-shell --master spark://50.140.197.217:7077 --driver-class-path
> /home/hduser/jars/jconn4.jar  --driver-class-path
> /home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/jconn4.jar --jars
> /home/hduser/jars/ojdbc6.jar
>
>
>
>
>
> This works for one jar file only and you need to add the jar file to both 
> --driver-class-path
> and –jars. I have not managed to work for more than one type of JAR file
>
>
>
> Thanks,
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free, therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
>
>
>
>
>
>
>


Re: Spark DataFrameNaFunctions unrecognized

2016-02-15 Thread Ted Yu
fill() was introduced in 1.3.1

Can you show code snippet which reproduces the error ?

I tried the following using spark-shell on master branch:

scala> df.na.fill(0)
res0: org.apache.spark.sql.DataFrame = [col: int]

Cheers

On Mon, Feb 15, 2016 at 3:36 AM, satish chandra j 
wrote:

> Hi All,
> Currently I am using Spark 1.4.0 version, getting error when trying to use
> "fill" function which is one among DataFrameNaFunctions
>
> Snippet:
> df.na.fill(col: )
>
> Error:
> value na is not a member of org.apache.spark.sql.DataFrame
>
> As I need null values in column "col" of DataFrame "df" to be replaced
> with value "" as given in the above snippet.
>
> I understand, code does not require any additional packages to support
> DataFrameNaFunctions
>
> Please let me know if I am missing anything so that I can make these
> DataFrameNaFunctions working
>
> Regards,
> Satish Chandra J
>


Re: Running synchronized JRI code

2016-02-15 Thread Simon Hafner
2016-02-15 14:02 GMT+01:00 Sun, Rui :
> On computation, RRDD launches one R process for each partition, so there 
> won't be thread-safe issue
>
> Could you give more details on your new environment?

Running on EC2, I start the executors via

 /usr/bin/R CMD javareconf -e "/usr/lib/spark/sbin/start-master.sh"

I invoke R via roughly

object R {
  case class Element(value: Double)
  lazy val re = Option(REngine.getLastEngine()).getOrElse({
val eng = new JRI.JRIEngine()

eng.parseAndEval(scala.io.Source.fromInputStream(this.getClass().getClassLoader().getResourceAsStream("r/fit.R")).mkString)
eng
  })

  def fit(curve: Seq[Element]): Option[Fitting] = {
synchronized {
  val env = re.newEnvironment(null, false)
  re.assign("curve", new REXPDouble(curve.map(_.value).toArray), env)
  val df = re.parseAndEval("data.frame(curve=curve)", env, true)
  re.assign("df", df, env)
  val fitted = re.parseAndEval("fit(df)", env, true).asList
  if (fitted.keys == null) {
None
  } else {
val map = fitted.keys.map(key => (key,
fitted.at(key).asDouble)).toMap
Some(Fitting(map("values")))
  }
}
  }
}

where `fit` is wrapped in an UDAF.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi Jorge/ All,

Please please please go through this link
http://spark.apache.org/docs/latest/spark-standalone.html.

The link tells you how to start a SPARK cluster in local mode. If you have
not started or worked in SPARK cluster in local mode kindly do not attempt
in answering this question.

My question is how to use packages like
https://github.com/databricks/spark-csv when I using SPARK cluster in local
mode.

Regards,
Gourav Sengupta



On Mon, Feb 15, 2016 at 1:55 PM, Jorge Machado  wrote:

> Hi Gourav,
>
> I did not unterstand your problem… the - - packages  command should not
> make any difference if you are running standalone or in YARN for example.
> Give us an example what packages are you trying to load, and what error
> are you getting…  If you want to use the libraries in spark-packages.org 
> without
> the --packages why do you not use maven ?
> Regards
>
>
> On 12/02/2016, at 13:22, Gourav Sengupta 
> wrote:
>
> Hi,
>
> I am creating sparkcontext in a SPARK standalone cluster as mentioned
> here: http://spark.apache.org/docs/latest/spark-standalone.html using the
> following code:
>
>
> --
> sc.stop()
> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \
>   .setMaster("spark://hostname:7077") \
>   .set('spark.shuffle.service.enabled', True) \
>   .set('spark.dynamicAllocation.enabled','true') \
>   .set('spark.executor.memory','20g') \
>   .set('spark.driver.memory', '4g') \
>
> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
> conf.getAll()
> sc = SparkContext(conf = conf)
>
> -(we should definitely be able to optimise the configuration but that
> is not the point here) ---
>
> I am not able to use packages, a list of which is mentioned here
> http://spark-packages.org, using this method.
>
> Where as if I use the standard "pyspark --packages" option then the
> packages load just fine.
>
> I will be grateful if someone could kindly let me know how to load
> packages when starting a cluster as mentioned above.
>
>
> Regards,
> Gourav Sengupta
>
>
>


Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Jorge Machado
Hi Gourav, 

I did not unterstand your problem… the - - packages  command should not make 
any difference if you are running standalone or in YARN for example.  
Give us an example what packages are you trying to load, and what error are you 
getting…  If you want to use the libraries in spark-packages.org 
 without the --packages why do you not use maven ? 
Regards 


> On 12/02/2016, at 13:22, Gourav Sengupta  wrote:
> 
> Hi,
> 
> I am creating sparkcontext in a SPARK standalone cluster as mentioned here: 
> http://spark.apache.org/docs/latest/spark-standalone.html 
>  using the 
> following code:
> 
> --
> sc.stop()
> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \
>   .setMaster("spark://hostname:7077") \
>   .set('spark.shuffle.service.enabled', True) \
>   .set('spark.dynamicAllocation.enabled','true') \
>   .set('spark.executor.memory','20g') \
>   .set('spark.driver.memory', '4g') \
>   
> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
> conf.getAll()
> sc = SparkContext(conf = conf)
> 
> -(we should definitely be able to optimise the configuration but that is 
> not the point here) ---
> 
> I am not able to use packages, a list of which is mentioned here 
> http://spark-packages.org , using this method. 
> 
> Where as if I use the standard "pyspark --packages" option then the packages 
> load just fine.
> 
> I will be grateful if someone could kindly let me know how to load packages 
> when starting a cluster as mentioned above.
> 
> 
> Regards,
> Gourav Sengupta



RE: Running synchronized JRI code

2016-02-15 Thread Sun, Rui
On computation, RRDD launches one R process for each partition, so there won't 
be thread-safe issue

Could you give more details on your new environment?

-Original Message-
From: Simon Hafner [mailto:reactorm...@gmail.com] 
Sent: Monday, February 15, 2016 7:31 PM
To: Sun, Rui 
Cc: user 
Subject: Re: Running synchronized JRI code

2016-02-15 4:35 GMT+01:00 Sun, Rui :
> Yes, JRI loads an R dynamic library into the executor JVM, which faces 
> thread-safe issue when there are multiple task threads within the executor.
>
> I am thinking if the demand like yours (calling R code in RDD 
> transformations) is much desired, we may consider refactoring RRDD for this 
> purpose, although it is currently intended for internal use by SparkR and not 
> a public API.
So the RRDDs don't have that thread safety issue? I'm currently creating a new 
environment for each call, but it still crashes.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Single context Spark from Python and Scala

2016-02-15 Thread Chandeep Singh
You could consider using Zeppelin - 
https://zeppelin.incubator.apache.org/docs/latest/interpreter/spark.html 


https://zeppelin.incubator.apache.org/ 

ZeppelinContext
Zeppelin automatically injects ZeppelinContext as variable 'z' in your 
scala/python environment. ZeppelinContext provides some additional functions 
and utility.



Object exchange

ZeppelinContext extends map and it's shared between scala, python environment. 
So you can put some object from scala and read it from python, vise versa.

Put object from scala

%spark
val myObject = ...
z.put("objName", myObject)
Get object from python

%python
myObject = z.get("objName")


> On Feb 15, 2016, at 12:10 PM, Leonid Blokhin  wrote:
> 
> Hello
> 
>  I want to work with single context Spark from Python and Scala. Is it 
> possible?
> 
> Is it possible to do betwen started  ./bin/pyspark and ./bin/spark-shell for 
> dramatic example?
> 
> 
> 
> Cheers,
> 
> Leonid
> 



Single context Spark from Python and Scala

2016-02-15 Thread Leonid Blokhin
Hello

 I want to work with single context Spark from Python and Scala. Is it
possible?

Is it possible to do betwen started  ./bin/pyspark and ./bin/spark-shell
for dramatic example?


Cheers,

Leonid


Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi,

I am grateful for everyone's response, but sadly no one here actually has
read the question before responding.

Has anyone yet tried starting a SPARK cluster as mentioned in the link in
my email?

:)

Regards,
Gourav

On Mon, Feb 15, 2016 at 11:16 AM, Jorge Machado  wrote:

> $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
>
>
>
> It will download everything for you and register into your  JVM.  If you
> want to use it in your Prod just package it with maven.
>
> On 15/02/2016, at 12:14, Gourav Sengupta 
> wrote:
>
> Hi,
>
> How to we include the following package:
> https://github.com/databricks/spark-csv while starting a SPARK standalone
> cluster as mentioned here:
> http://spark.apache.org/docs/latest/spark-standalone.html
>
>
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Mon, Feb 15, 2016 at 10:32 AM, Ramanathan R 
> wrote:
>
>> Hi Gourav,
>>
>> If your question is how to distribute python package dependencies across
>> the Spark cluster programmatically? ...here is an example -
>>
>>  $ export
>> PYTHONPATH='path/to/thrift.zip:path/to/happybase.zip:path/to/your/py/application'
>>
>> And in code:
>>
>> sc.addPyFile('/path/to/thrift.zip')
>> sc.addPyFile('/path/to/happybase.zip')
>>
>> Regards,
>> Ram
>>
>>
>>
>> On 15 February 2016 at 15:16, Gourav Sengupta 
>> wrote:
>>
>>> Hi,
>>>
>>> So far no one is able to get my question at all. I know what it takes to
>>> load packages via SPARK shell or SPARK submit.
>>>
>>> How do I load packages when starting a SPARK cluster, as mentioned here
>>> http://spark.apache.org/docs/latest/spark-standalone.html ?
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>>
>>>
>>> On Mon, Feb 15, 2016 at 3:25 AM, Divya Gehlot 
>>> wrote:
>>>
 with conf option

 spark-submit --conf 'key = value '

 Hope that helps you.

 On 15 February 2016 at 11:21, Divya Gehlot 
 wrote:

> Hi Gourav,
> you can use like below to load packages at the start of the spark
> shell.
>
> spark-shell  --packages com.databricks:spark-csv_2.10:1.1.0
>
> On 14 February 2016 at 03:34, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I was interested in knowing how to load the packages into SPARK
>> cluster started locally. Can someone pass me on the links to set the conf
>> file so that the packages can be loaded?
>>
>> Regards,
>> Gourav
>>
>> On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz 
>> wrote:
>>
>>> Hello Gourav,
>>>
>>> The packages need to be loaded BEFORE you start the JVM, therefore
>>> you won't be able to add packages dynamically in code. You should use 
>>> the
>>> --packages with pyspark before you start your application.
>>> One option is to add a `conf` that will load some packages if you
>>> are constantly going to use them.
>>>
>>> Best,
>>> Burak
>>>
>>>
>>>
>>> On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
 Hi,

 I am creating sparkcontext in a SPARK standalone cluster as
 mentioned here:
 http://spark.apache.org/docs/latest/spark-standalone.html using
 the following code:


 --
 sc.stop()
 conf = SparkConf().set( 'spark.driver.allowMultipleContexts' ,
 False) \
   .setMaster("spark://hostname:7077") \
   .set('spark.shuffle.service.enabled', True) \
   .set('spark.dynamicAllocation.enabled','true') \
   .set('spark.executor.memory','20g') \
   .set('spark.driver.memory', '4g') \

 .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
 conf.getAll()
 sc = SparkContext(conf = conf)

 -(we should definitely be able to optimise the configuration
 but that is not the point here) ---

 I am not able to use packages, a list of which is mentioned here
 http://spark-packages.org, using this method.

 Where as if I use the standard "pyspark --packages" option then the
 packages load just fine.

 I will be grateful if someone could kindly let me know how to load
 packages when starting a cluster as mentioned above.


 Regards,
 Gourav Sengupta

>>>
>>>
>>
>

>>>
>>
>
>


Spark DataFrameNaFunctions unrecognized

2016-02-15 Thread satish chandra j
Hi All,
Currently I am using Spark 1.4.0 version, getting error when trying to use
"fill" function which is one among DataFrameNaFunctions

Snippet:
df.na.fill(col: )

Error:
value na is not a member of org.apache.spark.sql.DataFrame

As I need null values in column "col" of DataFrame "df" to be replaced with
value "" as given in the above snippet.

I understand, code does not require any additional packages to support
DataFrameNaFunctions

Please let me know if I am missing anything so that I can make these
DataFrameNaFunctions working

Regards,
Satish Chandra J


Re: Running synchronized JRI code

2016-02-15 Thread Simon Hafner
2016-02-15 4:35 GMT+01:00 Sun, Rui :
> Yes, JRI loads an R dynamic library into the executor JVM, which faces 
> thread-safe issue when there are multiple task threads within the executor.
>
> I am thinking if the demand like yours (calling R code in RDD 
> transformations) is much desired, we may consider refactoring RRDD for this 
> purpose, although it is currently intended for internal use by SparkR and not 
> a public API.
So the RRDDs don't have that thread safety issue? I'm currently
creating a new environment for each call, but it still crashes.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add kafka streaming jars when initialising the sparkcontext in python

2016-02-15 Thread Jorge Machado
Hi David, 

Just package with maven and deploy everthing into one jar. You don´t need to do 
it like this…  Use Maven for example. And check if your cluster already has 
this libraries loaded. If you are using CDH for example you can just import the 
classes because they already are in the path from your JVM . 
> On 15/02/2016, at 10:35, David Kennedy  wrote:
> 
> I have no problems when submitting the task using spark-submit.  The --jars 
> option with the list of jars required is successful and I see in the output 
> the jars being added:
> 
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR 
> file:/usr/lib/spark/extras/lib/spark-streaming-kafka.jar at 
> http://192.168.10.4:33820/jars/spark-streaming-kafka.jar 
>  with timestamp 
> 1455102864058
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR 
> file:/opt/kafka/libs/scala-library-2.10.1.jar at 
> http://192.168.10.4:33820/jars/scala-library-2.10.1.jar 
>  with timestamp 
> 1455102864077
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR 
> file:/opt/kafka/libs/kafka_2.10-0.8.1.1.jar at 
> http://192.168.10.4:33820/jars/kafka_2.10-0.8.1.1.jar 
>  with timestamp 
> 1455102864085
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR 
> file:/opt/kafka/libs/metrics-core-2.2.0.jar at 
> http://192.168.10.4:33820/jars/metrics-core-2.2.0.jar 
>  with timestamp 
> 1455102864086
> 16/02/10 11:14:24 INFO spark.SparkContext: Added JAR 
> file:/usr/share/java/mysql.jar at http://192.168.10.4:33820/jars/mysql.jar 
>  with timestamp 1455102864090
> 
> But when I try to programmatically create a context in python (I want to set 
> up some tests) I don't see this and I end up with class not found errors.
> 
> Trying to cover all bases even though I suspect that it's redundant when 
> running local I've tried:
> 
> spark_conf = SparkConf()
> spark_conf.setMaster('local[4]')
> spark_conf.set('spark.executor.extraLibraryPath',
>'/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>'/opt/kafka/libs/scala-library-2.10.1.jar,'
>'/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>'/opt/kafka/libs/metrics-core-2.2.0.jar,'
>'/usr/share/java/mysql.jar')
> spark_conf.set('spark.executor.extraClassPath',
>'/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>'/opt/kafka/libs/scala-library-2.10.1.jar,'
>'/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>'/opt/kafka/libs/metrics-core-2.2.0.jar,'
>'/usr/share/java/mysql.jar')
> spark_conf.set('spark.driver.extraClassPath',
>'/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>'/opt/kafka/libs/scala-library-2.10.1.jar,'
>'/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>'/opt/kafka/libs/metrics-core-2.2.0.jar,'
>'/usr/share/java/mysql.jar')
> spark_conf.set('spark.driver.extraLibraryPath',
>'/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>'/opt/kafka/libs/scala-library-2.10.1.jar,'
>'/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>'/opt/kafka/libs/metrics-core-2.2.0.jar,'
>'/usr/share/java/mysql.jar')
> self.spark_context = SparkContext(conf=spark_conf)
> But still I get the same failure to find the same class:
> 
> Py4JJavaError: An error occurred while calling o30.loadClass.
> : java.lang.ClassNotFoundException: 
> org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper
> 
> The class is certainly in the spark_streaming_kafka.jar and is present in the 
> filesystem at that location.
> 
> I'm under the impression that were I using java I'd be able to use the 
> addJars method on the conf to take care of this but there doesn't appear to 
> be anything that corresponds for python.
> 
> Hacking about I found that adding:
> 
> 
> spark_conf.set('spark.jars',
>'/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
>'/opt/kafka/libs/scala-library-2.10.1.jar,'
>'/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
>'/opt/kafka/libs/metrics-core-2.2.0.jar,'
>'/usr/share/java/mysql.jar')
> got the logging to admit to adding the jars to the http server (just as for 
> the spark submit output above) but leaving the other config options in place 
> or removing them the class is still not found.
> 
> Is this not possible in python?
> 
> Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's 
> deprecated and ignored anyway) and I cannot find anything else to try.
> 
> Can anybody help?
> 
> David K.
> 



Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Jorge Machado
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.3.0


It will download everything for you and register into your  JVM.  If you want 
to use it in your Prod just package it with maven. 

> On 15/02/2016, at 12:14, Gourav Sengupta  wrote:
> 
> Hi,
> 
> How to we include the following package: 
> https://github.com/databricks/spark-csv 
>  while starting a SPARK standalone 
> cluster as mentioned here: 
> http://spark.apache.org/docs/latest/spark-standalone.html 
> 
> 
> 
> 
> Thanks and Regards,
> Gourav Sengupta
> 
> On Mon, Feb 15, 2016 at 10:32 AM, Ramanathan R  > wrote:
> Hi Gourav, 
> 
> If your question is how to distribute python package dependencies across the 
> Spark cluster programmatically? ...here is an example - 
> 
>  $ export 
> PYTHONPATH='path/to/thrift.zip:path/to/happybase.zip:path/to/your/py/application'
> 
> And in code:
> 
> sc.addPyFile('/path/to/thrift.zip')
> sc.addPyFile('/path/to/happybase.zip')
> 
> Regards, 
> Ram
> 
> 
> 
> On 15 February 2016 at 15:16, Gourav Sengupta  > wrote:
> Hi,
> 
> So far no one is able to get my question at all. I know what it takes to load 
> packages via SPARK shell or SPARK submit. 
> 
> How do I load packages when starting a SPARK cluster, as mentioned here 
> http://spark.apache.org/docs/latest/spark-standalone.html 
>  ?
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> 
> 
> On Mon, Feb 15, 2016 at 3:25 AM, Divya Gehlot  > wrote:
> with conf option 
> 
> spark-submit --conf 'key = value '
> 
> Hope that helps you.
> 
> On 15 February 2016 at 11:21, Divya Gehlot  > wrote:
> Hi Gourav,
> you can use like below to load packages at the start of the spark shell.
> 
> spark-shell  --packages com.databricks:spark-csv_2.10:1.1.0   
> 
> On 14 February 2016 at 03:34, Gourav Sengupta  > wrote:
> Hi,
> 
> I was interested in knowing how to load the packages into SPARK cluster 
> started locally. Can someone pass me on the links to set the conf file so 
> that the packages can be loaded? 
> 
> Regards,
> Gourav
> 
> On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz  > wrote:
> Hello Gourav,
> 
> The packages need to be loaded BEFORE you start the JVM, therefore you won't 
> be able to add packages dynamically in code. You should use the --packages 
> with pyspark before you start your application.
> One option is to add a `conf` that will load some packages if you are 
> constantly going to use them.
> 
> Best,
> Burak
> 
> 
> 
> On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta  > wrote:
> Hi,
> 
> I am creating sparkcontext in a SPARK standalone cluster as mentioned here: 
> http://spark.apache.org/docs/latest/spark-standalone.html 
>  using the 
> following code:
> 
> --
> sc.stop()
> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \
>   .setMaster("spark://hostname:7077") \
>   .set('spark.shuffle.service.enabled', True) \
>   .set('spark.dynamicAllocation.enabled','true') \
>   .set('spark.executor.memory','20g') \
>   .set('spark.driver.memory', '4g') \
>   
> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
> conf.getAll()
> sc = SparkContext(conf = conf)
> 
> -(we should definitely be able to optimise the configuration but that is 
> not the point here) ---
> 
> I am not able to use packages, a list of which is mentioned here 
> http://spark-packages.org , using this method. 
> 
> Where as if I use the standard "pyspark --packages" option then the packages 
> load just fine.
> 
> I will be grateful if someone could kindly let me know how to load packages 
> when starting a cluster as mentioned above.
> 
> 
> Regards,
> Gourav Sengupta
> 
> 
> 
> 
> 
> 
> 



Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi,

How to we include the following package:
https://github.com/databricks/spark-csv while starting a SPARK standalone
cluster as mentioned here:
http://spark.apache.org/docs/latest/spark-standalone.html



Thanks and Regards,
Gourav Sengupta

On Mon, Feb 15, 2016 at 10:32 AM, Ramanathan R 
wrote:

> Hi Gourav,
>
> If your question is how to distribute python package dependencies across
> the Spark cluster programmatically? ...here is an example -
>
>  $ export
> PYTHONPATH='path/to/thrift.zip:path/to/happybase.zip:path/to/your/py/application'
>
> And in code:
>
> sc.addPyFile('/path/to/thrift.zip')
> sc.addPyFile('/path/to/happybase.zip')
>
> Regards,
> Ram
>
>
>
> On 15 February 2016 at 15:16, Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> So far no one is able to get my question at all. I know what it takes to
>> load packages via SPARK shell or SPARK submit.
>>
>> How do I load packages when starting a SPARK cluster, as mentioned here
>> http://spark.apache.org/docs/latest/spark-standalone.html ?
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>>
>>
>> On Mon, Feb 15, 2016 at 3:25 AM, Divya Gehlot 
>> wrote:
>>
>>> with conf option
>>>
>>> spark-submit --conf 'key = value '
>>>
>>> Hope that helps you.
>>>
>>> On 15 February 2016 at 11:21, Divya Gehlot 
>>> wrote:
>>>
 Hi Gourav,
 you can use like below to load packages at the start of the spark shell.

 spark-shell  --packages com.databricks:spark-csv_2.10:1.1.0

 On 14 February 2016 at 03:34, Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Hi,
>
> I was interested in knowing how to load the packages into SPARK
> cluster started locally. Can someone pass me on the links to set the conf
> file so that the packages can be loaded?
>
> Regards,
> Gourav
>
> On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz  wrote:
>
>> Hello Gourav,
>>
>> The packages need to be loaded BEFORE you start the JVM, therefore
>> you won't be able to add packages dynamically in code. You should use the
>> --packages with pyspark before you start your application.
>> One option is to add a `conf` that will load some packages if you are
>> constantly going to use them.
>>
>> Best,
>> Burak
>>
>>
>>
>> On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am creating sparkcontext in a SPARK standalone cluster as
>>> mentioned here:
>>> http://spark.apache.org/docs/latest/spark-standalone.html using the
>>> following code:
>>>
>>>
>>> --
>>> sc.stop()
>>> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' ,
>>> False) \
>>>   .setMaster("spark://hostname:7077") \
>>>   .set('spark.shuffle.service.enabled', True) \
>>>   .set('spark.dynamicAllocation.enabled','true') \
>>>   .set('spark.executor.memory','20g') \
>>>   .set('spark.driver.memory', '4g') \
>>>
>>> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
>>> conf.getAll()
>>> sc = SparkContext(conf = conf)
>>>
>>> -(we should definitely be able to optimise the configuration but
>>> that is not the point here) ---
>>>
>>> I am not able to use packages, a list of which is mentioned here
>>> http://spark-packages.org, using this method.
>>>
>>> Where as if I use the standard "pyspark --packages" option then the
>>> packages load just fine.
>>>
>>> I will be grateful if someone could kindly let me know how to load
>>> packages when starting a cluster as mentioned above.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>
>>
>

>>>
>>
>


Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Ramanathan R
Hi Gourav,

If your question is how to distribute python package dependencies across
the Spark cluster programmatically? ...here is an example -

 $ export
PYTHONPATH='path/to/thrift.zip:path/to/happybase.zip:path/to/your/py/application'

And in code:

sc.addPyFile('/path/to/thrift.zip')
sc.addPyFile('/path/to/happybase.zip')

Regards,
Ram



On 15 February 2016 at 15:16, Gourav Sengupta 
wrote:

> Hi,
>
> So far no one is able to get my question at all. I know what it takes to
> load packages via SPARK shell or SPARK submit.
>
> How do I load packages when starting a SPARK cluster, as mentioned here
> http://spark.apache.org/docs/latest/spark-standalone.html ?
>
>
> Regards,
> Gourav Sengupta
>
>
>
>
> On Mon, Feb 15, 2016 at 3:25 AM, Divya Gehlot 
> wrote:
>
>> with conf option
>>
>> spark-submit --conf 'key = value '
>>
>> Hope that helps you.
>>
>> On 15 February 2016 at 11:21, Divya Gehlot 
>> wrote:
>>
>>> Hi Gourav,
>>> you can use like below to load packages at the start of the spark shell.
>>>
>>> spark-shell  --packages com.databricks:spark-csv_2.10:1.1.0
>>>
>>> On 14 February 2016 at 03:34, Gourav Sengupta >> > wrote:
>>>
 Hi,

 I was interested in knowing how to load the packages into SPARK cluster
 started locally. Can someone pass me on the links to set the conf file so
 that the packages can be loaded?

 Regards,
 Gourav

 On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz  wrote:

> Hello Gourav,
>
> The packages need to be loaded BEFORE you start the JVM, therefore you
> won't be able to add packages dynamically in code. You should use the
> --packages with pyspark before you start your application.
> One option is to add a `conf` that will load some packages if you are
> constantly going to use them.
>
> Best,
> Burak
>
>
>
> On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> I am creating sparkcontext in a SPARK standalone cluster as
>> mentioned here:
>> http://spark.apache.org/docs/latest/spark-standalone.html using the
>> following code:
>>
>>
>> --
>> sc.stop()
>> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False)
>> \
>>   .setMaster("spark://hostname:7077") \
>>   .set('spark.shuffle.service.enabled', True) \
>>   .set('spark.dynamicAllocation.enabled','true') \
>>   .set('spark.executor.memory','20g') \
>>   .set('spark.driver.memory', '4g') \
>>
>> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
>> conf.getAll()
>> sc = SparkContext(conf = conf)
>>
>> -(we should definitely be able to optimise the configuration but
>> that is not the point here) ---
>>
>> I am not able to use packages, a list of which is mentioned here
>> http://spark-packages.org, using this method.
>>
>> Where as if I use the standard "pyspark --packages" option then the
>> packages load just fine.
>>
>> I will be grateful if someone could kindly let me know how to load
>> packages when starting a cluster as mentioned above.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>
>

>>>
>>
>


New line lost in streaming output file

2016-02-15 Thread Ashutosh Kumar
I am getting multiple empty files for streaming output for each interval.
To Avoid this I tried

 kStream.foreachRDD(new VoidFunction2(){





*public void call(JavaRDD rdd,Time time) throws Exception {
if(!rdd.isEmpty()){
rdd.saveAsTextFile("filename_"+time.milliseconds()+".csv");
}}*
This prevents writing of empty files. However this appends line after one
another by removing new lines. All lines are merged.
How do I retain my new line?

Thanks
Ashutosh


Re: Using SPARK packages in Spark Cluster

2016-02-15 Thread Gourav Sengupta
Hi,

So far no one is able to get my question at all. I know what it takes to
load packages via SPARK shell or SPARK submit.

How do I load packages when starting a SPARK cluster, as mentioned here
http://spark.apache.org/docs/latest/spark-standalone.html ?


Regards,
Gourav Sengupta




On Mon, Feb 15, 2016 at 3:25 AM, Divya Gehlot 
wrote:

> with conf option
>
> spark-submit --conf 'key = value '
>
> Hope that helps you.
>
> On 15 February 2016 at 11:21, Divya Gehlot 
> wrote:
>
>> Hi Gourav,
>> you can use like below to load packages at the start of the spark shell.
>>
>> spark-shell  --packages com.databricks:spark-csv_2.10:1.1.0
>>
>> On 14 February 2016 at 03:34, Gourav Sengupta 
>> wrote:
>>
>>> Hi,
>>>
>>> I was interested in knowing how to load the packages into SPARK cluster
>>> started locally. Can someone pass me on the links to set the conf file so
>>> that the packages can be loaded?
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Fri, Feb 12, 2016 at 6:52 PM, Burak Yavuz  wrote:
>>>
 Hello Gourav,

 The packages need to be loaded BEFORE you start the JVM, therefore you
 won't be able to add packages dynamically in code. You should use the
 --packages with pyspark before you start your application.
 One option is to add a `conf` that will load some packages if you are
 constantly going to use them.

 Best,
 Burak



 On Fri, Feb 12, 2016 at 4:22 AM, Gourav Sengupta <
 gourav.sengu...@gmail.com> wrote:

> Hi,
>
> I am creating sparkcontext in a SPARK standalone cluster as mentioned
> here: http://spark.apache.org/docs/latest/spark-standalone.html using
> the following code:
>
>
> --
> sc.stop()
> conf = SparkConf().set( 'spark.driver.allowMultipleContexts' , False) \
>   .setMaster("spark://hostname:7077") \
>   .set('spark.shuffle.service.enabled', True) \
>   .set('spark.dynamicAllocation.enabled','true') \
>   .set('spark.executor.memory','20g') \
>   .set('spark.driver.memory', '4g') \
>
> .set('spark.default.parallelism',(multiprocessing.cpu_count() -1 ))
> conf.getAll()
> sc = SparkContext(conf = conf)
>
> -(we should definitely be able to optimise the configuration but
> that is not the point here) ---
>
> I am not able to use packages, a list of which is mentioned here
> http://spark-packages.org, using this method.
>
> Where as if I use the standard "pyspark --packages" option then the
> packages load just fine.
>
> I will be grateful if someone could kindly let me know how to load
> packages when starting a cluster as mentioned above.
>
>
> Regards,
> Gourav Sengupta
>


>>>
>>
>


How to add kafka streaming jars when initialising the sparkcontext in python

2016-02-15 Thread David Kennedy
I have no problems when submitting the task using spark-submit.  The --jars
option with the list of jars required is successful and I see in the output
the jars being added:

16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/usr/lib/spark/extras/lib/spark-streaming-kafka.jar at
http://192.168.10.4:33820/jars/spark-streaming-kafka.jar with timestamp
1455102864058
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/scala-library-2.10.1.jar at
http://192.168.10.4:33820/jars/scala-library-2.10.1.jar with timestamp
1455102864077
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/kafka_2.10-0.8.1.1.jar at
http://192.168.10.4:33820/jars/kafka_2.10-0.8.1.1.jar with timestamp
1455102864085
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/opt/kafka/libs/metrics-core-2.2.0.jar at
http://192.168.10.4:33820/jars/metrics-core-2.2.0.jar with timestamp
1455102864086
16/02/10 11:14:24 INFO spark.SparkContext: Added JAR
file:/usr/share/java/mysql.jar at http://192.168.10.4:33820/jars/mysql.jar
with timestamp 1455102864090

But when I try to programmatically create a context in python (I want to
set up some tests) I don't see this and I end up with class not found
errors.

Trying to cover all bases even though I suspect that it's redundant when
running local I've tried:

spark_conf = SparkConf()
spark_conf.setMaster('local[4]')
spark_conf.set('spark.executor.extraLibraryPath',
   '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
   '/opt/kafka/libs/scala-library-2.10.1.jar,'
   '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
   '/opt/kafka/libs/metrics-core-2.2.0.jar,'
   '/usr/share/java/mysql.jar')
spark_conf.set('spark.executor.extraClassPath',
   '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
   '/opt/kafka/libs/scala-library-2.10.1.jar,'
   '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
   '/opt/kafka/libs/metrics-core-2.2.0.jar,'
   '/usr/share/java/mysql.jar')
spark_conf.set('spark.driver.extraClassPath',
   '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
   '/opt/kafka/libs/scala-library-2.10.1.jar,'
   '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
   '/opt/kafka/libs/metrics-core-2.2.0.jar,'
   '/usr/share/java/mysql.jar')
spark_conf.set('spark.driver.extraLibraryPath',
   '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
   '/opt/kafka/libs/scala-library-2.10.1.jar,'
   '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
   '/opt/kafka/libs/metrics-core-2.2.0.jar,'
   '/usr/share/java/mysql.jar')
self.spark_context = SparkContext(conf=spark_conf)

But still I get the same failure to find the same class:

Py4JJavaError: An error occurred while calling o30.loadClass.
: java.lang.ClassNotFoundException:
org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper

The class is certainly in the spark_streaming_kafka.jar and is present in
the filesystem at that location.

I'm under the impression that were I using java I'd be able to use the
addJars method on the conf to take care of this but there doesn't appear to
be anything that corresponds for python.

Hacking about I found that adding:


spark_conf.set('spark.jars',
   '/usr/lib/spark/extras/lib/spark-streaming-kafka.jar,'
   '/opt/kafka/libs/scala-library-2.10.1.jar,'
   '/opt/kafka/libs/kafka_2.10-0.8.1.1.jar,'
   '/opt/kafka/libs/metrics-core-2.2.0.jar,'
   '/usr/share/java/mysql.jar')

got the logging to admit to adding the jars to the http server (just as for
the spark submit output above) but leaving the other config options in
place or removing them the class is still not found.

Is this not possible in python?

Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's
deprecated and ignored anyway) and I cannot find anything else to try.

Can anybody help?

David K.


Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Chandeep Singh
You could also fire up a VNC session and access all internal pages from there.

> On Feb 15, 2016, at 9:19 AM, Divya Gehlot  wrote:
> 
> Hi Sabarish,
> Thanks alot for your help.
> I am able to view the logs now 
> 
> Thank you very much .
> 
> Cheers,
> Divya 
> 
> 
> On 15 February 2016 at 16:51, Sabarish Sasidharan 
> > 
> wrote:
> You can setup SSH tunneling.
> 
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
>  
> 
> 
> Regards
> Sab
> 
> On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot  > wrote:
> Hi,
> I have hadoop cluster set up in EC2.
> I am unable to view application logs in Web UI as its taking internal IP 
> Like below :
> http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042 
> 
> 
> How can I change this to external one or redirecting to external ?
> Attached screenshots for better understanding of my issue.
> 
> Would really appreciate help.
> 
> 
> Thanks,
> Divya 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 
> 
> -- 
> 
> Architect - Big Data
> Ph: +91 99805 99458 
> 
> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
> India ICT)
> +++
> 



Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Divya Gehlot
Hi Sabarish,
Thanks alot for your help.
I am able to view the logs now

Thank you very much .

Cheers,
Divya


On 15 February 2016 at 16:51, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> You can setup SSH tunneling.
>
>
> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html
>
> Regards
> Sab
>
> On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot 
> wrote:
>
>> Hi,
>> I have hadoop cluster set up in EC2.
>> I am unable to view application logs in Web UI as its taking internal IP
>> Like below :
>> http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042
>> 
>>
>> How can I change this to external one or redirecting to external ?
>> Attached screenshots for better understanding of my issue.
>>
>> Would really appreciate help.
>>
>>
>> Thanks,
>> Divya
>>
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>


RE: Passing multiple jar files to spark-shell

2016-02-15 Thread Mich Talebzadeh
Thanks Deng. Unfortunately it seems that it looks for driver-class-path as well 
:(

 

For example with –jars alone I get

 

spark-shell --master spark://50.140.197.217:7077 --jars 
/home/hduser/jars/ojdbc6.jar,/home/hduser/jars/jconn4.jar

 

 

s: org.apache.spark.sql.DataFrame = [AMOUNT_SOLD: decimal(10,0), TIME_ID: 
timestamp, CHANNEL_ID: bigint]

warning: there were 1 deprecation warning(s); re-run with -deprecation for 
details

java.sql.SQLException: No suitable driver found for 
jdbc:oracle:thin:@rhes564:1521:mydb

 

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: odeach...@gmail.com [mailto:odeach...@gmail.com] On Behalf Of Deng 
Ching-Mallete
Sent: 15 February 2016 03:27
To: Mich Talebzadeh 
Cc: user 
Subject: Re: Passing multiple jar files to spark-shell

 

Hi Mich,

 

For the --jars parameter, just pass in the jars as comma-delimited. As for the 
--driver-class-path, make it colon-delimited -- similar to how you set multiple 
paths for an environment variable (e.g. --driver-class-path 
/home/hduser/jars/jconn4.jar:/home/hduse/jars/ojdbc6.jar). But if you're 
already passing the jar files via the --jars param, you shouldn't need to set 
them in the driver-class-path though, since they should already be 
automatically added to the classpath.

 

HTH,

Deng

 

On Mon, Feb 15, 2016 at 7:35 AM, Mich Talebzadeh  > wrote:

Hi,

 

Is there anyway one can pass multiple --driver-class-path and multiple –jars to 
spark shell.

 

For example something as below with two jar files entries for Oracle 
(ojdbc6.jar) and Sybase IQ (jcoon4,jar)

 

spark-shell --master spark://50.140.197.217:7077   
--driver-class-path /home/hduser/jars/jconn4.jar  --driver-class-path 
/home/hduser/jars/ojdbc6.jar --jars /home/hduser/jars/jconn4.jar --jars 
/home/hduser/jars/ojdbc6.jar

 

 

This works for one jar file only and you need to add the jar file to both 
--driver-class-path and –jars. I have not managed to work for more than one 
type of JAR file

 

Thanks,

 

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

 

 



Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Akhil Das
According to the documentation
,
the hostname that you are seeing for those properties are inherited
from  *yarn.nodemanager.hostname
*if your requirement is just to see the logs, then you can ssh-tunnel to
the machine (ssh -L 8042:127.0.0.1:8042 mastermachine)

Thanks
Best Regards

On Mon, Feb 15, 2016 at 2:29 PM, Divya Gehlot 
wrote:

> Yes you are correct Akhil
> But I unable to find that property
> Please find attached screen shots for more details.
>
> Thanks,
> Divya
>
>
>
> On 15 February 2016 at 16:37, Akhil Das 
> wrote:
>
>> You can set *yarn.nodemanager.webapp.address* in the
>> yarn-site.xml/yarn-default.xml file to change it I guess.
>>
>> Thanks
>> Best Regards
>>
>> On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot 
>> wrote:
>>
>> > Hi,
>> > I have hadoop cluster set up in EC2.
>> > I am unable to view application logs in Web UI as its taking internal IP
>> > Like below :
>> > http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042
>> > 
>> >
>> > How can I change this to external one or redirecting to external ?
>> > Attached screenshots for better understanding of my issue.
>> >
>> > Would really appreciate help.
>> >
>> >
>> > Thanks,
>> > Divya
>> >
>> >
>> >
>>
>
>


Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Sabarish Sasidharan
You can setup SSH tunneling.

http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-ssh-tunnel.html

Regards
Sab

On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot 
wrote:

> Hi,
> I have hadoop cluster set up in EC2.
> I am unable to view application logs in Web UI as its taking internal IP
> Like below :
> http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042
> 
>
> How can I change this to external one or redirecting to external ?
> Attached screenshots for better understanding of my issue.
>
> Would really appreciate help.
>
>
> Thanks,
> Divya
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>



-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++


RE: How to join an RDD with a hive table?

2016-02-15 Thread Mich Talebzadeh
Also worthwhile using temporary tables for the joint query.

 

I can join a Hive table with any other JDBC accessed table from any other 
databases with DF and temporary tables 

 

//

//Get the FACT table from Hive

//

var s = HiveContext.sql("SELECT AMOUNT_SOLD, TIME_ID, CHANNEL_ID FROM 
oraclehadoop.sales")

 

//

//Get the Dimension table from Oracle via JDBC

//

val c = HiveContext.load("jdbc",

Map("url" -> "jdbc:oracle:thin:@rhes564:1521:mydb",

"dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM 
sh.channels)",

"user" -> "sh",

"password" -> "xxx"))

 

 

s.registerTempTable("t_s")

c.registerTempTable("t_c")

 

And do the join

 

SELECT rs.Month, rs.SalesChannel, round(TotalSales,2)

FROM

(

SELECT t_t.CALENDAR_MONTH_DESC AS Month, t_c.CHANNEL_DESC AS SalesChannel, 
SUM(t_s.AMOUNT_SOLD) AS TotalSales

FROM t_s, t_t, t_c

WHERE t_s.TIME_ID = t_t.TIME_ID

AND   t_s.CHANNEL_ID = t_c.CHANNEL_ID

GROUP BY t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

ORDER by t_t.CALENDAR_MONTH_DESC, t_c.CHANNEL_DESC

) rs

LIMIT 1000

"""

HiveContext.sql(sqltext).collect.foreach(println)

 

HTH

 

Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com  

 

NOTE: The information in this email is proprietary and confidential. This 
message is for the designated recipient only, if you are not the intended 
recipient, you should destroy it immediately. Any information in this message 
shall not be understood as given or endorsed by Peridale Technology Ltd, its 
subsidiaries or their employees, unless expressly so stated. It is the 
responsibility of the recipient to ensure that this email is virus free, 
therefore neither Peridale Technology Ltd, its subsidiaries nor their employees 
accept any responsibility.

 

 

From: Ted Yu [mailto:yuzhih...@gmail.com] 
Sent: 15 February 2016 08:44
To: SRK 
Cc: user 
Subject: Re: How to join an RDD with a hive table?

 

Have you tried creating a DataFrame from the RDD and join with DataFrame which 
corresponds to the hive table ?

 

On Sun, Feb 14, 2016 at 9:53 PM, SRK  > wrote:

Hi,

How to join an RDD with a hive table and retrieve only the records that I am
interested. Suppose, I have an RDD that has 1000 records and there is a Hive
table with 100,000 records, I should be able to join the RDD with the hive
table  by an Id and I should be able to load only those 1000 records from
Hive table so that are no memory issues. Also, I was planning on storing the
data in hive in the form of parquet files. Any help on this is greatly
appreciated.

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-an-RDD-with-a-hive-table-tp26225.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 
For additional commands, e-mail: user-h...@spark.apache.org 
 

 



Re: How to join an RDD with a hive table?

2016-02-15 Thread Ted Yu
Have you tried creating a DataFrame from the RDD and join with DataFrame
which corresponds to the hive table ?

On Sun, Feb 14, 2016 at 9:53 PM, SRK  wrote:

> Hi,
>
> How to join an RDD with a hive table and retrieve only the records that I
> am
> interested. Suppose, I have an RDD that has 1000 records and there is a
> Hive
> table with 100,000 records, I should be able to join the RDD with the hive
> table  by an Id and I should be able to load only those 1000 records from
> Hive table so that are no memory issues. Also, I was planning on storing
> the
> data in hive in the form of parquet files. Any help on this is greatly
> appreciated.
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-join-an-RDD-with-a-hive-table-tp26225.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Akhil Das
You can set *yarn.nodemanager.webapp.address* in the
yarn-site.xml/yarn-default.xml file to change it I guess.

Thanks
Best Regards

On Mon, Feb 15, 2016 at 1:55 PM, Divya Gehlot 
wrote:

> Hi,
> I have hadoop cluster set up in EC2.
> I am unable to view application logs in Web UI as its taking internal IP
> Like below :
> http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042
> 
>
> How can I change this to external one or redirecting to external ?
> Attached screenshots for better understanding of my issue.
>
> Would really appreciate help.
>
>
> Thanks,
> Divya
>
>
>


Re: Best way to bring up Spark with Cassandra (and Elasticsearch) in production.

2016-02-15 Thread Ted Yu
Sounds reasonable.

Please consider posting question on Spark C* connector on their mailing
list if you have any.

On Sun, Feb 14, 2016 at 7:51 PM, Kevin Burton  wrote:

> Afternoon.
>
> About 6 months ago I tried (and failed) to get Spark and Cassandra working
> together in production due to dependency hell.
>
> I'm going to give it another try!
>
> Here's my general strategy.
>
> I'm going to create a maven module for my code... with spark dependencies.
>
> Then I'm going to get that to run and have unit tests for reading from
> files and writing the data back out the way I want via spark jobs.
>
> Then I'm going to setup cassandra unit to embed cassandra in my project.
> Then I'm going to point Spark to Cassandra and have the same above code
> work with Cassandra but instead of reading from a file it reads/writes to
> C*.
>
> Then once testing is working I'm going to setup spark in cluster mode with
> the same dependencies.
>
> Does this sound like a reasonable strategy?
>
> Kevin
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Re: Unable to insert overwrite table with Spark 1.5.2

2016-02-15 Thread Ted Yu
Do you mind trying Spark 1.6.0 ?

As far as I can tell, 'Cannot overwrite table' exception may only occur
for CreateTableUsingAsSelect when source and dest relations refer to the
same table in branch-1.6

Cheers

On Sun, Feb 14, 2016 at 9:29 PM, Ramanathan R 
wrote:

> Hi All,
>
> Spark 1.5.2 does not seem to be backward compatible with functionality
> that was available in earlier versions, at least in 1.3.1 and 1.4.1. It is
> not possible to insert overwrite into an existing table that was read as a
> DataFrame initially.
>
> Our existing code base has few internal Hive tables being overwritten
> after some join operations.
>
> For e.g.
> val PRODUCT_TABLE = "product_dim"
> val productDimDF = hiveContext.table(PRODUCT_TABLE)
> // Joins, filters ...
> productDimDF.write.mode(SaveMode.Overwrite).insertInto(PRODUCT_TABLE)
>
> This results in the exception -
> org.apache.spark.sql.AnalysisException: Cannot overwrite table
> `product_dim` that is also being read from.;
> at
> org.apache.spark.sql.execution.datasources.PreWriteCheck.failAnalysis(rules.scala:82)
> at
> org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$2.apply(rules.scala:155)
> at
> org.apache.spark.sql.execution.datasources.PreWriteCheck$$anonfun$apply$2.apply(rules.scala:85)
>
> Is there any configuration to disable this particular rule? Any pointers
> to solve this would be very helpful.
>
> Thanks,
> Ram
>


Need help :Does anybody has HDP cluster on EC2?

2016-02-15 Thread Divya Gehlot
Hi,
I have hadoop cluster set up in EC2.
I am unable to view application logs in Web UI as its taking internal IP
Like below :
http://ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal:8042


How can I change this to external one or redirecting to external ?
Attached screenshots for better understanding of my issue.

Would really appreciate help.


Thanks,
Divya

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Scala types to StructType

2016-02-15 Thread Ted Yu
Please the last line of convertToCatalyst(a: Any) :

   case other => other

FYI

On Mon, Feb 15, 2016 at 12:09 AM, Fabian Böhnlein <
fabian.boehnl...@gmail.com> wrote:

> Interesting, thanks.
>
> The (only) publicly accessible method seems *convertToCatalyst*:
>
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L425
>
> Seems it's missing some types like Integer, Short, Long... I'll give it a
> try.
>
> Thanks,
> Fabian
>
>
> On 12/02/16 05:53, Yogesh Mahajan wrote:
>
> Right, Thanks Ted.
>
> On Fri, Feb 12, 2016 at 10:21 AM, Ted Yu  wrote:
>
>> Minor correction: the class is CatalystTypeConverters.scala
>>
>> On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan <
>> ymaha...@snappydata.io> wrote:
>>
>>> CatatlystTypeConverters.scala has all types of utility methods to
>>> convert from Scala to row and vice a versa.
>>>
>>>
>>> On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan <
>>> rishabh...@gmail.com> wrote:
>>>
 I had the same issue. I resolved it in Java, but I am pretty sure it
 would work with scala too. Its kind of a gross hack. But what I did is say
 I had a table in Mysql with 1000 columns
 what is did is that I threw a jdbc query to extracted the schema of the
 table. I stored that schema and wrote a map function to create StructFields
 using structType and Row.Factory. Then I took that table loaded as a
 dataFrame, event though it had a schema. I converted that data frame into
 an RDD, this is when it lost the schema. Then performed something using
 that RDD and then converted back that RDD with the structfield.
 If your source is structured type then it would be better if you can
 load it directly as a DF that way you can preserve the schema. However, in
 your case you should do something like this
 List fields = new ArrayList
 for(keys in MAP)
  fields.add(DataTypes.createStructField(keys, DataTypes.StringType,
 true));

 StrructType schemaOfDataFrame = DataTypes.createStructType(conffields);

 sqlcontext.createDataFrame(rdd, schemaOfDataFrame);

 This is how I would do it to make it in Java, not sure about scala
 syntax. Please tell me if that helped.

 On Feb 11, 2016, at 7:20 AM, Fabian Böhnlein <
 fabian.boehnl...@gmail.com> wrote:

 Hi all,

 is there a way to create a Spark SQL Row schema based on Scala data
 types without creating a manual mapping?

 That's the only example I can find which doesn't require
 spark.sql.types.DataType already as input, but it requires to define them
 as Strings.

 * val struct = (new StructType)*   .add("a", "int")*   .add("b", "long")*  
  .add("c", "string")



 Specifically I have an RDD where each element is a Map of 100s of
 variables with different data types which I want to transform to a 
 DataFrame
 where the keys should end up as the column names:

 Map ("Amean" -> 20.3, "Asize" -> 12, "Bmean" -> )


 Is there a different possibility than building a mapping from the
 values' .getClass to the Spark SQL DataTypes?


 Thanks,
 Fabian





>>>
>>
>
>


Re: Scala types to StructType

2016-02-15 Thread Fabian Böhnlein

Interesting, thanks.

The (only) publicly accessible method seems /convertToCatalyst/:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L425

Seems it's missing some types like Integer, Short, Long... I'll give it 
a try.


Thanks,
Fabian

On 12/02/16 05:53, Yogesh Mahajan wrote:

Right, Thanks Ted.

On Fri, Feb 12, 2016 at 10:21 AM, Ted Yu > wrote:


Minor correction: the class is CatalystTypeConverters.scala

On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan
> wrote:

CatatlystTypeConverters.scala has all types of utility methods
to convert from Scala to row and vice a versa.


On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan
> wrote:

I had the same issue. I resolved it in Java, but I am
pretty sure it would work with scala too. Its kind of a
gross hack. But what I did is say I had a table in Mysql
with 1000 columns
what is did is that I threw a jdbc query to extracted the
schema of the table. I stored that schema and wrote a map
function to create StructFields using structType and
Row.Factory. Then I took that table loaded as a dataFrame,
event though it had a schema. I converted that data frame
into an RDD, this is when it lost the schema. Then
performed something using that RDD and then converted back
that RDD with the structfield.
If your source is structured type then it would be better
if you can load it directly as a DF that way you can
preserve the schema. However, in your case you should do
something like this
List fields = new ArrayList
for(keys in MAP)
 fields.add(DataTypes.createStructField(keys,
DataTypes.StringType, true));

StrructType schemaOfDataFrame =
DataTypes.createStructType(conffields);

sqlcontext.createDataFrame(rdd, schemaOfDataFrame);

This is how I would do it to make it in Java, not sure
about scala syntax. Please tell me if that helped.

On Feb 11, 2016, at 7:20 AM, Fabian Böhnlein
> wrote:

Hi all,

is there a way to create a Spark SQL Row schema based on
Scala data types without creating a manual mapping?

That's the only example I can find which doesn't require
spark.sql.types.DataType already as input, but it
requires to define them as Strings.

* val struct = (new StructType) * .add("a", "int") *
.add("b", "long") * .add("c", "string")


Specifically I have an RDD where each element is a Map of
100s of variables with different data types which I want
to transform to a DataFrame
where the keys should end up as the column names:
Map ("Amean" -> 20.3, "Asize" -> 12, "Bmean" -> )

Is there a different possibility than building a mapping
from the values' .getClass to the Spark SQL DataTypes?


Thanks,
Fabian










Re: mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-15 Thread Yanbo Liang
Hi Stuti,

This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
infinity" properly.
I have open https://issues.apache.org/jira/browse/SPARK-13322 to track this
issue and will send a PR.
Thanks for reporting this issue.

Yanbo

2016-02-12 15:03 GMT+08:00 Stuti Awasthi :

> Hi All,
>
> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to
> run the AFT example provided. Now I tried to train the model with Ovarian
> data which is standard data comes with Survival library in R.
>
> Default Column Name :  *Futime,fustat,age,resid_ds,rx,ecog_ps*
>
>
>
> Here are the steps I have done :
>
> · Loaded the data from csv to dataframe labeled as
>
> *val* ovarian_data = sqlContext.read
>
>   .format("com.databricks.spark.csv")
>
>   .option("header", "true") // Use first line of all files as header
>
>   .option("inferSchema", "true") // Automatically infer data types
>
>   .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx",
> "ecog_ps")
>
> · Utilize the VectorAssembler() to create features from "age",
> "resid_ds", "rx", "ecog_ps" like
>
> *val* assembler = *new* VectorAssembler()
>
> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
>
> .setOutputCol("features")
>
>
>
> · Then I create a new dataframe with only 3 colums as :
>
> *val* training = finalDf.select("label", "censor", "features")
>
>
>
> · Finally Im passing it to AFT
>
> *val* model = aft.fit(training)
>
>
>
> Im getting the error as :
>
> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is
> infinity. Error for unknown reason.*
>
>at scala.Predef$.assert(*Predef.scala:179*)
>
>at org.apache.spark.ml.regression.AFTAggregator.add(
> *AFTSurvivalRegression.scala:480*)
>
>at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
> *AFTSurvivalRegression.scala:522*)
>
>at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
> *AFTSurvivalRegression.scala:521*)
>
>at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
> *TraversableOnce.scala:144*)
>
>at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
> *TraversableOnce.scala:144*)
>
>at scala.collection.Iterator$class.foreach(*Iterator.scala:727*)
>
>
>
> I have tried to print the schema :
>
> ()root
>
> |-- label: double (nullable = true)
>
> |-- censor: double (nullable = true)
>
> |-- features: vector (nullable = true)
>
>
>
> Sample data training looks like
>
> [59.0,1.0,[72.3315,2.0,1.0,1.0]]
>
> [115.0,1.0,[74.4932,2.0,1.0,1.0]]
>
> [156.0,1.0,[66.4658,2.0,1.0,2.0]]
>
> [421.0,0.0,[53.3644,2.0,2.0,1.0]]
>
> [431.0,1.0,[50.3397,2.0,1.0,1.0]]
>
>
>
> Im not able to understand about the error, as if I use same data and
> create the denseVector as given in Sample example of AFT, then code works
> completely fine. But I would like to read the data from CSV file and then
> proceed.
>
>
>
> Please suggest
>
>
>
> Thanks 
>
> Stuti Awasthi
>
>
>
>
>
> ::DISCLAIMER::
>
> 
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> 
>