date:20160202

Re: Getting the size of a broadcast variable

2016-02-02 Thread Ted Yu

There is chance that the log message may change in future releases.

Log snooping would be broken.

FYI

On Mon, Feb 1, 2016 at 9:55 PM, Takeshi Yamamuro 
wrote:

> Hi,
>
> Currently, there is no way to check the size except for snooping INFO-logs
> in a driver;
>
> 16/02/02 14:51:53 INFO BlockManagerInfo: Added rdd_2_12 in memory on
> localhost:58536 (size: 40.0 B, free: 510.7 MB)
>
>
>
> On Tue, Feb 2, 2016 at 8:20 AM, apu mishra . rr 
> wrote:
>
>> How can I determine the size (in bytes) of a broadcast variable? Do I
>> need to use the .dump method and then look at the size of the result, or is
>> there an easier way?
>>
>> Using PySpark with Spark 1.6.
>>
>> Thanks!
>>
>> Apu
>>
>
>
>
> --
> ---
> Takeshi Yamamuro
>

RE: can we do column bind of 2 dataframes in spark R? similar to cbind in R?

2016-02-02 Thread Sun, Rui

Devesh,

The cbind-like operation is not supported by Scala DataFrame API, so it is also 
not supported in SparkR.

You may try to workaround this by trying the approach in 
http://stackoverflow.com/questions/32882529/how-to-zip-twoor-more-dataframe-in-spark

You could also submit a JIRA requesting such feature in Spark community.

From: Devesh Raj Singh [mailto:raj.deves...@gmail.com]
Sent: Tuesday, February 2, 2016 2:08 PM
To: user@spark.apache.org
Subject: can we do column bind of 2 dataframes in spark R? similar to cbind in 
R?

Hi,

I want to merge 2 dataframes in sparkR columnwise similar to cbind in R. We 
have "unionAll" for r bind but could not find anything for cbind in sparkR

--
Warm regards,
Devesh.

Re: Guidelines for writing SPARK packages

2016-02-02 Thread Praveen Devarao

Thanks David.

I am looking at extending the SparkSQL library with a custom 
package...hence was looking at more from details on any specific classes 
to be extended or implement (with) to achieve the redirect of calls to my 
module (when using .format).

If you have any info on these lines do share with me...else debugging 
through would be the way :-)

Thanking You

Praveen Devarao

From:   David Russell 
To: Praveen Devarao/India/IBM@IBMIN
Cc: user 
Date:   01/02/2016 07:03 pm
Subject:Re: Guidelines for writing SPARK packages
Sent by:marchoffo...@gmail.com

Hi Praveen,

The basic requirements for releasing a Spark package on
spark-packages.org are as follows:

1. The package content must be hosted by GitHub in a public repo under
the owner's account.
2. The repo name must match the package name.
3. The master branch of the repo must contain "README.md" and "LICENSE".

Per the doc on spark-packages.org site an example package that meets
those requirements can be found at
https://github.com/databricks/spark-avro. My own recently released
SAMBA package also meets these requirements:
https://github.com/onetapbeyond/lambda-spark-executor.

As you can see there is nothing in this list of requirements that
demands the implementation of specific interfaces. What you'll need to
implement will depend entirely on what you want to accomplish. If you
want to register a release for your package you will also need to push
the artifacts for your package to Maven central.

David

On Mon, Feb 1, 2016 at 7:03 AM, Praveen Devarao  
wrote:
> Hi,
>
> Is there any guidelines or specs to write a Spark package? I 
would
> like to implement a spark package and would like to know the way it 
needs to
> be structured (implement some interfaces etc) so that it can plug into 
Spark
> for extended functionality.
>
> Could any one help me point to docs or links on the above?
>
> Thanking You
>
> Praveen Devarao

-- 
"All that is gold does not glitter, Not all those who wander are lost."

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Streaming:Could not compute split

2016-02-02 Thread aafri

Hi,


We run Spark Streaming on YARN,the Streaming Driver restart very often.
I don't  known what's the matter.
The exception is below:


16/02/01 18:55:14 ERROR scheduler.JobScheduler: Error running job streaming job 
1454324113000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in 
stage 11032.0 failed 4 times, most recent failure: Lost task 2.3 in stage 
11032.0 (TID 16137, hadoop225.localdomain): java.lang.Exception: Could not 
compute split, block input-0-1454324106400 not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1294)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1282)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1281)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1281)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1507)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1469)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1921)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:898)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:896)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:896)
at 
com.in.sevenseas.onedegree.OneDegreeCompute_Repair$$anonfun$addCompute$1.apply(OneDegreeCompute_Repair.scala:98)
at 
com.in.sevenseas.onedegree.OneDegreeCompute_Repair$$anonfun$addCompute$1.apply(OneDegreeCompute_Repair.scala:95)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at 
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.sca

Master failover results in running job marked as "WAITING"

2016-02-02 Thread Anthony Tang

Hi - 

I'm running Spark 1.5.2 in standalone mode with multiple masters using 
zookeeper for failover.  The master fails over correctly to the standby when it 
goes down, and running applications seem to continue to run, but in the new 
active master web UI, they are marked as "WAITING", and the workers have these 
entries in their logs: 

16/01/30 00:51:13 ERROR Worker: Connection to master failed! Waiting for master 
to reconnect... 
16/01/30 00:51:13 WARN Worker: Failed to connect to master XXX:7077 
akka.actor.ActorNotFound: Actor not found for: 
ActorSelection[Anchor(akka.tcp://sparkMaster@XXX:7077/), Path(/user/Master)] 

Should they be "RUNNING" still? One time, it looked like the job stopped 
functioning (This is a continuously running streaming job), but I haven't been 
able to reproduce it.  FWIW, the driver that started it is still marked as 
"RUNNING". 

Thanks. 
- Anthony

Spark saveAsHadoopFile stage fails with ExecutorLostfailure

2016-02-02 Thread Prabhu Joseph

Hi All,

   Spark job stage having saveAsHadoopFile fails with ExecutorLostFailure
whenever the Executor is run with more cores. The stage is not memory
intensive, executor has 20GB memory. for example,

6 executors each with 6 cores, ExecutorLostFailure happens

10 executors each with 2 cores, saveAsHadoopFile runs fine.

What could be the reason for ExecutorLostFailure failing when cores per
executor is high.



Error: ExecutorLostFailure (executor 3 lost)

16/02/02 04:22:40 WARN TaskSetManager: Lost task 1.3 in stage 15.0 (TID
1318, hdnprd-c01-r01-14):



Thanks,
Prabhu Joseph

Re: Spark Streaming with Kafka - batch DStreams in memory

2016-02-02 Thread Cody Koeninger

It's possible you could (ab)use updateStateByKey or mapWithState for this.

But honestly it's probably a lot more straightforward to just choose a
reasonable batch size that gets you a reasonable file size for most of your
keys, then use filecrush or something similar to deal with the hdfs small
file problem.

On Mon, Feb 1, 2016 at 10:11 PM, p pathiyil  wrote:

> Hi,
>
> Are there any ways to store DStreams / RDD read from Kafka in memory to be
> processed at a later time ? What we need to do is to read data from Kafka,
> process it to be keyed by some attribute that is present in the Kafka
> messages, and write out the data related to each key when we have
> accumulated enough data for that key to write out a file that is close to
> the HDFS block size, say 64MB. We are looking at ways to avoid writing out
> some file of the entire Kafka content periodically and then later run a
> second job to read those files and split them out to another set of files
> as necessary.
>
> Thanks.
>

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao

Hi Robert,

I just use textFile. Here is the simple code:

val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count

do you suggest I should use sc.parallelize?

many thanks

From: Robert Collich [mailto:rcoll...@gmail.com]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Subject: Re: try to read multiple bz2 files in s3

Hi Hao,

Could you please post the corresponding code? Are you using textFile or 
sc.parallelize?

On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao 
mailto:hao@finra.org>> wrote:
When I tried to read multiple bz2 files from s3, I have the following warning 
messages. What is the problem here?

16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.

Re: Master failover results in running job marked as "WAITING"

2016-02-02 Thread Ted Yu

bq. Failed to connect to master XXX:7077

Is the 'XXX' above the hostname for the new master ?

Thanks

On Tue, Feb 2, 2016 at 1:48 AM, Anthony Tang 
wrote:

> Hi -
>
> I'm running Spark 1.5.2 in standalone mode with multiple masters using
> zookeeper for failover.  The master fails over correctly to the standby
> when it goes down, and running applications seem to continue to run, but in
> the new active master web UI, they are marked as "WAITING", and the workers
> have these entries in their logs:
>
> 16/01/30 00:51:13 ERROR Worker: Connection to master failed! Waiting for
> master to reconnect...
> 16/01/30 00:51:13 WARN Worker: Failed to connect to master XXX:7077
> akka.actor.ActorNotFound: Actor not found for:
> ActorSelection[Anchor(akka.tcp://sparkMaster@XXX:7077/),
> Path(/user/Master)]
>
> Should they be "RUNNING" still? One time, it looked like the job stopped
> functioning (This is a continuously running streaming job), but I haven't
> been able to reproduce it.  FWIW, the driver that started it is still
> marked as "RUNNING".
>
> Thanks.
> - Anthony
>

MLLib embedded dependencies

2016-02-02 Thread Valentin Popov

Hello every one. 

I have a some trouble to run word2vec, and run the libs… 

Is it possible to use spark MLLib as embedded library (like mllib.jar + 
spark-core.jar) inside Tomcat application (it is already has hadoop libs)? By 
default it is huge in one jar contains all dependencies and after inserting to 
webapps lib folder it generate conflicts with already present libraries. 

Any clue appreciate 

Regards,
Valentin.





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

[MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread jmvllt

Hi everyone,

This may sound like a stupid question but I need to be sure of this :

Given a dataframe composed by « n » features  : f1, f2, …, fn

For each row of my dataframe, I create a labeled point : 
val row_i = LabeledPoint(label, Vectors.dense(v1_i,v2_i,…, vn_i) ) 
where v1_i,v2_i,…, vn_i are respectively the values of the features f1, f2,
…, fn of the i th row.

Then, I fit a pipeline composed by a standardScaler and a logisticRegression
model.
When I get back my LogisticRegressionModel and StandardScalerModel from the
pipeline, I’m calling the getters : 
LogisticRegressionModel.coefficients, StandardScalerModel.mean and
StandardScalerModel.std

This gives me 3 vectors of length « n » 

My question is the following : 
Am I assured that the element of index « j » of each vectors correspond to 
the feature « j »  ? Is the "*order*" of the feature kept ?
e.g : Is StandardScalerModel.mean(j) the mean of the feature « j » of my
data frame ? 

Thanks for your time.
Regards,
J.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Is-the-order-of-the-coefficients-in-a-LogisticRegresionModel-kept-tp26137.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread Benjamin Kim

Hi David,

My company uses Lamba to do simple data moving and processing using python 
scripts. I can see using Spark instead for the data processing would make it 
into a real production level platform. Does this pave the way into replacing 
the need of a pre-instantiated cluster in AWS or bought hardware in a 
datacenter? If so, then this would be a great efficiency and make an easier 
entry point for Spark usage. I hope the vision is to get rid of all cluster 
management when using Spark.

Thanks,
Ben


> On Feb 1, 2016, at 4:23 AM, David Russell  wrote:
> 
> Hi all,
> 
> Just sharing news of the release of a newly available Spark package, SAMBA 
> . 
> 
> 
> https://github.com/onetapbeyond/lambda-spark-executor 
> 
> 
> SAMBA is an Apache Spark package offering seamless integration with the AWS 
> Lambda  compute service for Spark batch and 
> streaming applications on the JVM.
> 
> Within traditional Spark deployments RDD tasks are executed using fixed 
> compute resources on worker nodes within the Spark cluster. With SAMBA, 
> application developers can delegate selected RDD tasks to execute using 
> on-demand AWS Lambda compute infrastructure in the cloud.
> 
> Not unlike the recently released ROSE 
>  package that extends 
> the capabilities of traditional Spark applications with support for CRAN R 
> analytics, SAMBA provides another (hopefully) useful extension for Spark 
> application developers on the JVM.
> 
> SAMBA Spark Package: https://github.com/onetapbeyond/lambda-spark-executor 
> 
> ROSE Spark Package: https://github.com/onetapbeyond/opencpu-spark-executor 
> 
> 
> Questions, suggestions, feedback welcome.
> 
> David
> 
> -- 
> "All that is gold does not glitter, Not all those who wander are lost."

optimal way to load parquet files with partition

2016-02-02 Thread Wei Chen

Hi All,

I have data partitioned by year=/month=mm/day=dd, what is the best way
to get two months of data from a given year (let's say June and July)?

Two ways I can think of:
1. use unionAll
df1 = sqc.read.parquet('xxx/year=2015/month=6')
df2 = sqc.read.parquet('xxx/year=2015/month=7')
df = df1.unionAll(df2)

2. use filter after load the whole year
df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')

Which of the above is better? Or are there better ways to handle this?


Thank you,
Wei

Re: optimal way to load parquet files with partition

2016-02-02 Thread Michael Armbrust

It depends how many partitions you have and if you are only doing a single
operation.  Loading all the data and filtering will require us to scan the
directories to discover all the months.  This information will be cached.
Then we should prune and avoid reading unneeded data.

Option 1 does not require this scan, but is more work for the developer.

On Tue, Feb 2, 2016 at 10:07 AM, Wei Chen  wrote:

> Hi All,
>
> I have data partitioned by year=/month=mm/day=dd, what is the best way
> to get two months of data from a given year (let's say June and July)?
>
> Two ways I can think of:
> 1. use unionAll
> df1 = sqc.read.parquet('xxx/year=2015/month=6')
> df2 = sqc.read.parquet('xxx/year=2015/month=7')
> df = df1.unionAll(df2)
>
> 2. use filter after load the whole year
> df = sqc.read.parquet('xxx/year=2015/').filter('month in (6, 7)')
>
> Which of the above is better? Or are there better ways to handle this?
>
>
> Thank you,
> Wei
>

Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread David Russell

Hi Ben,

> My company uses Lamba to do simple data moving and processing using python
> scripts. I can see using Spark instead for the data processing would make it
> into a real production level platform.

That may be true. Spark has first class support for Python which
should make your life easier if you do go this route. Once you've
fleshed out your ideas I'm sure folks on this mailing list can provide
helpful guidance based on their real world experience with Spark.

> Does this pave the way into replacing
> the need of a pre-instantiated cluster in AWS or bought hardware in a
> datacenter?

In a word, no. SAMBA is designed to extend-not-replace the traditional
Spark computation and deployment model. At it's most basic, the
traditional Spark computation model distributes data and computations
across worker nodes in the cluster.

SAMBA simply allows some of those computations to be performed by AWS
Lambda rather than locally on your worker nodes. There are I believe a
number of potential benefits to using SAMBA in some circumstances:

1. It can help reduce some of the workload on your Spark cluster by
moving that workload onto AWS Lambda, an infrastructure on-demand
compute service.

2. It allows Spark applications written in Java or Scala to make use
of libraries and features offered by Python and JavaScript (Node.js)
today, and potentially, more libraries and features offered by
additional languages in the future as AWS Lambda language support
evolves.

3. It provides a simple, clean API for integration with REST APIs that
may be a benefit to Spark applications that form part of a broader
data pipeline or solution.

> If so, then this would be a great efficiency and make an easier
> entry point for Spark usage. I hope the vision is to get rid of all cluster
> management when using Spark.

You might find one of the hosted Spark platform solutions such as
Databricks or Amazon EMR that handle cluster management for you a good
place to start. At least in my experience, they got me up and running
without difficulty.

David

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: [MLlib] What is the best way to forecast the next month page visit?

2016-02-02 Thread diplomatic Guru

Hi Jorge,

Unfortunately, I couldn't transform the data as you suggested.

This is what I get:

+---+-+-+
| id|pageIndex|  pageVec|
+---+-+-+
|0.0|  3.0|(3,[],[])|
|1.0|  0.0|(3,[0],[1.0])|
|2.0|  2.0|(3,[2],[1.0])|
|3.0|  1.0|(3,[1],[1.0])|
+---+-+-+


This is the snippets:

JavaRDD jrdd = jsc.parallelize(Arrays.asList(
RowFactory.create(0.0, "PageA", 1.0, 2.0, 3.0),
RowFactory.create(1.0, "PageB", 4.0, 5.0, 6.0),
RowFactory.create(2.0, "PageC", 7.0, 8.0, 9.0),
RowFactory.create(3.0, "PageD", 10.0, 11.0, 12.0)

));

StructType schema = new StructType(new StructField[] {
new StructField("id", DataTypes.DoubleType, false,
Metadata.empty()),
new StructField("page", DataTypes.StringType, false,
Metadata.empty()),
new StructField("Nov", DataTypes.DoubleType, false,
Metadata.empty()),
new StructField("Dec", DataTypes.DoubleType, false,
Metadata.empty()),
new StructField("Jan", DataTypes.DoubleType, false,
Metadata.empty()) });

DataFrame df = sqlContext.createDataFrame(jrdd, schema);

StringIndexerModel indexer = new
StringIndexer().setInputCol("page").setInputCol("Nov")

.setInputCol("Dec").setInputCol("Jan").setOutputCol("pageIndex").fit(df);

OneHotEncoder encoder = new
OneHotEncoder().setInputCol("pageIndex").setOutputCol("pageVec");

DataFrame indexed = indexer.transform(df);

DataFrame encoded = encoder.transform(indexed);
encoded.select("id", "pageIndex", "pageVec").show();


Could you please let me know what I'm doing wrong?


PS: My cluster is running Spark 1.3.0, which doesn't support StringIndexer,
OneHotEncoder  but for testing this I've installed the 1.6.0 on my local
machine.

Cheer.


On 2 February 2016 at 10:25, Jorge Machado  wrote:

> Hi Guru,
>
> Any results ? :)
>
> On 01/02/2016, at 14:34, diplomatic Guru  wrote:
>
> Hi Jorge,
>
> Thank you for the reply and your example. I'll try your suggestion and
> will let you know the outcome.
>
> Cheers
>
>
> On 1 February 2016 at 13:17, Jorge Machado  wrote:
>
>> Hi Guru,
>>
>> So First transform your Name pages with OneHotEncoder (
>> https://spark.apache.org/docs/latest/ml-features.html#onehotencoder)
>> then make the same thing for months:
>>
>> You will end with something like:
>> (first tree are the pagename, the other the month,)
>> (0,0,1,0,0,1)
>>
>> then you have your label that is what you want to predict. At the end you
>> will have an LabeledPoint with (1 -> (0,0,1,0,0,1)) this will represent
>> (1 -> (PageA, UV_NOV))
>> After that try a regression tree with
>>
>> val model = DecisionTree.trainRegressor(trainingData,
>> categoricalFeaturesInfo, impurity,maxDepth, maxBins)
>>
>>
>> Regards
>> Jorge
>>
>> On 01/02/2016, at 12:29, diplomatic Guru 
>> wrote:
>>
>> Any suggestions please?
>>
>>
>> On 29 January 2016 at 22:31, diplomatic Guru 
>> wrote:
>>
>>> Hello guys,
>>>
>>> I'm trying understand how I could predict the next month page views
>>> based on the previous access pattern.
>>>
>>> For example, I've collected statistics on page views:
>>>
>>> e.g.
>>> Page,UniqueView
>>> -
>>> pageA, 1
>>> pageB, 999
>>> ...
>>> pageZ,200
>>>
>>> I aggregate the statistics monthly.
>>>
>>> I've prepared a file containing last 3 months as this:
>>>
>>> e.g.
>>> Page,UV_NOV, UV_DEC, UV_JAN
>>> ---
>>> pageA, 1,9989,11000
>>> pageB, 999,500,700
>>> ...
>>> pageZ,200,50,34
>>>
>>>
>>> Based on above information, I want to predict the next month (FEB).
>>>
>>> Which alogrithm do you think will suit most, I think linear regression
>>> is the safe bet. However, I'm struggling to prepare this data for LR ML,
>>> especially how do I prepare the X,Y relationship.
>>>
>>> The Y is easy (uniqiue visitors), but not sure about the X(it should be
>>> Page,right). However, how do I plot those three months of data.
>>>
>>> Could you give me an example based on above example data?
>>>
>>>
>>>
>>> Page,UV_NOV, UV_DEC, UV_JAN
>>> ---
>>> 1, 1,9989,11000
>>> 2, 999,500,700
>>> ...
>>> 26,200,50,34
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

RE: try to read multiple bz2 files in s3

2016-02-02 Thread Lin, Hao

Hi Xiangrui,

For the following problem, I found out an issue ticket you posted before 
https://issues.apache.org/jira/browse/HADOOP-10614

I wonder if this has been fixed in Spark 1.5.2 which I believe so. Any 
suggestion on how to fix it?

Thanks
Hao


From: Lin, Hao [mailto:hao@finra.org]
Sent: Tuesday, February 02, 2016 10:38 AM
To: Robert Collich; user
Subject: RE: try to read multiple bz2 files in s3

Hi Robert,

I just use textFile. Here is the simple code:

val fs3File=sc.textFile("s3n://my bucket/myfolder/")
fs3File.count

do you suggest I should use sc.parallelize?

many thanks

From: Robert Collich [mailto:rcoll...@gmail.com]
Sent: Monday, February 01, 2016 6:54 PM
To: Lin, Hao; user
Subject: Re: try to read multiple bz2 files in s3

Hi Hao,

Could you please post the corresponding code? Are you using textFile or 
sc.parallelize?

On Mon, Feb 1, 2016 at 2:36 PM Lin, Hao 
mailto:hao@finra.org>> wrote:
When I tried to read multiple bz2 files from s3, I have the following warning 
messages. What is the problem here?

16/02/01 22:30:30 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
10.162.67.248): java.lang.ArrayIndexOutOfBoundsException: -1844424343
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode0(CBZip2InputStream.java:1014)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.getAndMoveToFrontDecode(CBZip2InputStream.java:829)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.initBlock(CBZip2InputStream.java:504)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.changeStateToProcessABlock(CBZip2InputStream.java:333)
at 
org.apache.hadoop.io.compress.bzip2.CBZip2InputStream.read(CBZip2InputStream.java:399)
at 
org.apache.hadoop.io.compress.BZip2Codec$BZip2CompressionInputStream.read(BZip2Codec.java:483)
at java.io.InputStream.read(InputStream.java:101)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.fillBuffer(CompressedSplitLineReader.java:130)
at 
org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at 
org.apache.hadoop.mapreduce.lib.input.CompressedSplitLineReader.readLine(CompressedSplitLineReader.java:159)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:209)
at 
org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:47)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:248)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:216)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1553)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1125)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by rep

Re: Spark Pattern and Anti-Pattern

2016-02-02 Thread Lars Albertsson

Querying a service or a database from a Spark job is in most cases an
anti-pattern, but there are exceptions. The jobs become unstable and
indeterministic by relying on a live database.

The recommended pattern is to take regular dumps of the database to
your cluster storage, e.g. HDFS, and join the dump dataset with other
datasets, e.g. your incoming events. There are good and bad ways to
dump, however. I covered the topic in this presentation, which you may
find useful: http://www.slideshare.net/lallea/functional-architectural-patterns,
https://vimeo.com/channels/flatmap2015/128468974.

Let me know if you have follow-up questions, or want assistance.

Regards,

Lars Albertsson
Data engineering consultant
www.mapflat.com
+46 70 7687109

On Tue, Jan 26, 2016 at 10:25 PM, Daniel Schulz
 wrote:
> Hi,
>
> We are currently working on a solution architecture to solve IoT workloads
> on Spark. Therefore, I am interested in getting to know whether it is
> considered an Anti-Pattern in Spark to get records from a database and make
> a ReST call to an external server with that data. This external server may
> and will be the bottleneck -- but from a Spark point of view: is it possibly
> harmful to open connections and wait for their responses for vast amounts of
> rows?
>
> In the same manner: is calling an external library (instead of making a ReST
> call) for any row possibly problematic?
>
> How to rather embed a C++ library in this workflow: is it best to make a
> function having a JNI call to run it natively -- iff we know we are single
> threaded then? Or is there a better way to include C++ code in Spark jobs?
>
> Many thanks in advance.
>
> Kind regards, Daniel.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark 1.5.2 memory error

2016-02-02 Thread Stefan Panayotov

Hi Guys,
 
I need help with Spark memory errors when executing ML pipelines.
The error that I see is:
 



16/02/02 20:34:17 INFO Executor: Executor is trying to kill task
32.0 in stage 32.0 (TID 3298)


16/02/02 20:34:17 INFO Executor: Executor is trying to kill task
12.0 in stage 32.0 (TID 3278)


16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720)
called with curMem=296303415, maxMem=8890959790


16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored
as bytes in memory (estimated size 1911.9 MB, free 6.1 GB)


16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED
SIGNAL 15: SIGTERM


16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in
stage 32.0 (TID 3278)


java.lang.OutOfMemoryError:
Java heap space


   at
java.util.Arrays.copyOf(Arrays.java:2271)


   at
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)


   at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)


   at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)


   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


   at
java.lang.Thread.run(Thread.java:745)


16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called


16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage
32.0 (TID 3298). 2004728720 bytes result sent via BlockManager)


16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught
exception in thread Thread[Executor task launch worker-8,5,main]


java.lang.OutOfMemoryError:
Java heap space


   at
java.util.Arrays.copyOf(Arrays.java:2271)


   at
java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)


   at
org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)


   at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)


   at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


   at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)


   at
java.lang.Thread.run(Thread.java:745)


16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called


16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping
azure-file-system metrics system...


16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread
interrupted.


16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system
metrics system stopped.


16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system
metrics system shutdown complete.


 


And …..


 


16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy:
Opening proxy : 10.0.0.5:30050


16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
container_1454421662639_0011_01_05 (state: COMPLETE, exit status: -104)


16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider
boosting spark.yarn.executor.memoryOverhead.


16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1
executor containers, each with 2 cores and 16768 MB memory including 384 MB
overhead


16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request
(host: Any, capability: )


16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
container_1454421662639_0011_01_37 for on host 10.0.0.8


16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching
ExecutorRunnable. driverUrl: 
akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler, 
executorHostname: 10.0.0.8


16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers
from YARN, launching executors on 1 of them.


I'll really appreciate any help here.
 
Thank you,


Stefan Panayotov, PhD 
Home: 610-355-0919 
Cell: 610-517-5586 
email: spanayo...@msn.com 
spanayo...@outlook.com 
spanayo...@comcast.net

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jakob Odersky

Can you share some code that produces the error? It is probably not
due to spark but rather the way data is handled in the user code.
Does your code call any reduceByKey actions? These are often a source
for OOM errors.

On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov  wrote:
> Hi Guys,
>
> I need help with Spark memory errors when executing ML pipelines.
> The error that I see is:
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> stage 32.0 (TID 3298)
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> stage 32.0 (TID 3278)
>
>
> 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with
> curMem=296303415, maxMem=8890959790
>
>
> 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in
> memory (estimated size 1911.9 MB, free 6.1 GB)
>
>
> 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> SIGTERM
>
>
> 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID
> 3278)
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> 3298). 2004728720 bytes result sent via BlockManager)
>
>
> 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
> thread Thread[Executor task launch worker-8,5,main]
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics
> system...
>
>
> 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> stopped.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> shutdown complete.
>
>
>
>
>
> And …..
>
>
>
>
>
> 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy
> : 10.0.0.5:30050
>
>
> 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> container_1454421662639_0011_01_05 (state: COMPLETE, exit status: -104)
>
>
> 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> containers, each with 2 cores and 16768 MB memory including 384 MB overhead
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
> container_1454421662639_0011_01_37 for on host 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl:
> akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler,
> executorHostname: 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN,
> launching executors on 1 of them.
>
>
> I'll really appreciate any help here.
>
> Thank you,
>
> Stefan Panayotov, PhD
> Home: 610-355-0919
> Cell: 610-517-5586
> email: spanayo...@msn.com
> spanayo...@outlook.com
> spanayo...@comcast.net
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Error trying to get DF for Hive table stored HBase

2016-02-02 Thread Doug Balog

I’m trying to create a DF for an external Hive table that is in HBase. 
I get the a NoSuchMethodError 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initSerdeParams(Lorg/apache/hadoop/conf/Configuration;Ljava/util/Properties;Ljava/lang/String;)Lorg/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe$SerDeParameters;

I’m running Spark 1.6.0 on HDP 2.2.4-12-1 (Hive 0.14 and HBase 0.98.4) in 
secure mode. 

Anybody see this before ?

Below is a stack trace and the hive table’s info.

scala> sqlContext.table("item_data_lib.pcn_item")
java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initSerdeParams(Lorg/apache/hadoop/conf/Configuration;Ljava/util/Properties;Ljava/lang/String;)Lorg/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe$SerDeParameters;
at 
org.apache.hadoop.hive.hbase.HBaseSerDeParameters.(HBaseSerDeParameters.java:93)
at 
org.apache.hadoop.hive.hbase.HBaseSerDe.initialize(HBaseSerDe.java:92)
at 
org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
at 
org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
at 
org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:331)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:326)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:326)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:321)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
at 
org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
at 
org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:321)
at 
org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
at 
org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:457)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
at 
org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:457)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)


hive> show create table item_data_lib.pcn_item;
OK
CREATE EXTERNAL TABLE `item_data_lib.pcn_item`(
  `key` string COMMENT 'from deserializer',
  `p1` string COMMENT 'from deserializer',
  `p2` string COMMENT 'from deserializer',
  `p3` string COMMENT 'from deserializer',
  `p4` string COMMENT 'from deserializer',
  `p5` string COMMENT 'from deserializer',
  `p6` string COMMENT 'from deserializer',
  `p7` string COMMENT 'from deserializer',
  `p8` string COMMENT 'from deserializer',
  `p9` string COMMENT 'from deserializer',
  `p10` string COMMENT 'from deserializer',
  `p11` string COMMENT 'from deserializer',
  `p12` string COMMENT 'from deserializer',
  `p13` string COMMENT 'from deserializer',
  `d1` string COMMENT 'from deserializer',
  `d2` string COMMENT 'from deserializer',
  `d3` string COMMENT 'from deserializer',
  `d4` string COMMENT 'from deserializer',
  `d5` string COMMENT 'from deserializer',
  `d6` string COMMENT 'from deserializer',
  `d7` string COMMENT 'from deserializer',
  `d8` string COMMENT 'from deserializer',
  `d9` string COMMENT 'from deserializer',
  `d10` string COMMENT 'from deserializer',
  `d11` string COMMENT 'from deserializer',
  `d12` string COMMENT 'from deserializer',
  `d13` string COMMENT 'from deserializer',
  `d14` string COMMENT 'from deserializer',
  `d15` string COMMENT 'from deserializer',
  `d16` string COMMENT 'from deserializer',
  `d17` string COMMENT 'from deserializer')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY
  'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
  
'hbase.columns.mapping'=':key,p:p1,p:p2,p:p3,p:p4,p:p5,p:p6,p:

Re: Spark 1.5.2 memory error

2016-02-02 Thread Ted Yu

What value do you use for spark.yarn.executor.memoryOverhead ?

Please see https://spark.apache.org/docs/latest/running-on-yarn.html for
description of the parameter.

Which Spark release are you using ?

Cheers

On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky  wrote:

> Can you share some code that produces the error? It is probably not
> due to spark but rather the way data is handled in the user code.
> Does your code call any reduceByKey actions? These are often a source
> for OOM errors.
>
> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov 
> wrote:
> > Hi Guys,
> >
> > I need help with Spark memory errors when executing ML pipelines.
> > The error that I see is:
> >
> >
> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> > stage 32.0 (TID 3298)
> >
> >
> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> > stage 32.0 (TID 3278)
> >
> >
> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called
> with
> > curMem=296303415, maxMem=8890959790
> >
> >
> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as
> bytes in
> > memory (estimated size 1911.9 MB, free 6.1 GB)
> >
> >
> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> > SIGTERM
> >
> >
> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0
> (TID
> > 3278)
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> >
> >
> >at java.util.Arrays.copyOf(Arrays.java:2271)
> >
> >
> >at
> > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
> >
> >
> >at
> >
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
> >
> >
> >at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >
> >at java.lang.Thread.run(Thread.java:745)
> >
> >
> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
> >
> >
> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> > 3298). 2004728720 bytes result sent via BlockManager)
> >
> >
> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught
> exception in
> > thread Thread[Executor task launch worker-8,5,main]
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> >
> >
> >at java.util.Arrays.copyOf(Arrays.java:2271)
> >
> >
> >at
> > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
> >
> >
> >at
> >
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
> >
> >
> >at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >
> >at java.lang.Thread.run(Thread.java:745)
> >
> >
> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system
> metrics
> > system...
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
> system
> > stopped.
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
> system
> > shutdown complete.
> >
> >
> >
> >
> >
> > And …..
> >
> >
> >
> >
> >
> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening
> proxy
> > : 10.0.0.5:30050
> >
> >
> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> > container_1454421662639_0011_01_05 (state: COMPLETE, exit status:
> -104)
> >
> >
> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used.
> Consider
> > boosting spark.yarn.executor.memoryOverhead.
> >
> >
> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> > containers, each with 2 cores and 16768 MB memory including 384 MB
> overhead
> >
> >
> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
> > capability: )
> >
> >
> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
> > container_1454421662639_0011_01_37 for on host 10.0.0.8
> >
> >
> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> > driverUrl:
> > akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler,
> > executorHostname: 10.0.0.8
> >
> >
> > 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from
> YARN,
> > launching executors on 1 of them.
> >
> >
> > I'll really appreciate any help here.
> >
> > Thank you,
> >
> > Stefan Panayotov, PhD
> > Home: 610-355-091

RE: Spark 1.5.2 memory error

2016-02-02 Thread Stefan Panayotov

For the memoryOvethead I have the default of 10% of 16g, and Spark version is 
1.5.2.

Stefan Panayotov, PhD
Sent from Outlook Mail for Windows 10 phone


From: Ted Yu
Sent: Tuesday, February 2, 2016 4:52 PM
To: Jakob Odersky
Cc: Stefan Panayotov; user@spark.apache.org
Subject: Re: Spark 1.5.2 memory error

What value do you use for spark.yarn.executor.memoryOverhead ?

Please see https://spark.apache.org/docs/latest/running-on-yarn.html for 
description of the parameter.

Which Spark release are you using ?

Cheers

On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky  wrote:
Can you share some code that produces the error? It is probably not
due to spark but rather the way data is handled in the user code.
Does your code call any reduceByKey actions? These are often a source
for OOM errors.

On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov  wrote:
> Hi Guys,
>
> I need help with Spark memory errors when executing ML pipelines.
> The error that I see is:
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> stage 32.0 (TID 3298)
>
>
> 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> stage 32.0 (TID 3278)
>
>
> 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called with
> curMem=296303415, maxMem=8890959790
>
>
> 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as bytes in
> memory (estimated size 1911.9 MB, free 6.1 GB)
>
>
> 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> SIGTERM
>
>
> 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0 (TID
> 3278)
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>        at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>        at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>        at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>        at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> 3298). 2004728720 bytes result sent via BlockManager)
>
>
> 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
> thread Thread[Executor task launch worker-8,5,main]
>
>
> java.lang.OutOfMemoryError: Java heap space
>
>
>        at java.util.Arrays.copyOf(Arrays.java:2271)
>
>
>        at
> java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
>
>
>        at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
>
>
>        at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>        at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>
>        at java.lang.Thread.run(Thread.java:745)
>
>
> 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system metrics
> system...
>
>
> 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> stopped.
>
>
> 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics system
> shutdown complete.
>
>
>
>
>
> And …..
>
>
>
>
>
> 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening proxy
> : 10.0.0.5:30050
>
>
> 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> container_1454421662639_0011_01_05 (state: COMPLETE, exit status: -104)
>
>
> 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> exceeding memory limits. 16.8 GB of 16.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> containers, each with 2 cores and 16768 MB memory including 384 MB overhead
>
>
> 16/02/02 20:33:56 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching container
> container_1454421662639_0011_01_37 for on host 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl:
> akka.tcp://sparkDriver@10.0.0.15:47446/user/CoarseGrainedScheduler,
> executorHostname: 10.0.0.8
>
>
> 16/02/02 20:33:57 INFO yarn.YarnAllocator: Received 1 containers from YARN,
> launching executors on 1 of them.
>
>
> I'll really appreciate any help here.
>
> Thank you,
>
> Stefan Panayotov, PhD
> Home: 610-355-0919
> Cell: 610-517-5586
> email: spanayo...@msn.com
> spanayo...@outlook.co

[Spark 1.5+] ReceiverTracker seems not to stop Kinesis receivers

2016-02-02 Thread Roberto Coluccio

Hi,

I'm struggling around an issue ever since I tried to upgrade my Spark
Streaming solution from 1.4.1 to 1.5+.

I have a Spark Streaming app which creates 3 ReceiverInputDStreams
leveraging KinesisUtils.createStream API.

I used to leverage a timeout to terminate my app
(StreamingContext.awaitTerminationOrTimeout(timeout)) gracefully (SparkConf
spark.streaming.stopGracefullyOnShutdown=true).

I used to submit my Spark app on EMR in yarn-cluster mode.

Everything worked fine up to Spark 1.4.1 (on EMR AMI 3.9).

Since I upgraded (tried with Spark 1.5.2 on emr-4.2.0 and Spark 1.6.0 on
emr-4.3.0) I can't get the app to actually terminate. Logs tells me it
tries to, but no confirmation of receivers stop is retrieved. Instead, when
the timer gets to the next period, the StreamingContext continues its
processing for a while (then it gets killed with a SIGTERM 15. YARN's vmem
and pmem killls disabled).

...

16/02/02 21:22:08 INFO ApplicationMaster: Final app status: SUCCEEDED,
exitCode: 0
16/02/02 21:22:08 INFO StreamingContext: Invoking
stop(stopGracefully=true) from shutdown hook
16/02/02 21:22:08 INFO ReceiverTracker: Sent stop signal to all 3 receivers
16/02/02 21:22:18 INFO ReceiverTracker: Waiting for receiver job to
terminate gracefully
16/02/02 21:22:52 INFO ContextCleaner: Cleaned shuffle 141
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0
on 172.31.3.140:50152 in memory (size: 23.9 KB, free: 2.1 GB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0
on ip-172-31-3-141.ec2.internal:41776 in memory (size: 23.9 KB, free:
1224.9 MB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0
on ip-172-31-3-140.ec2.internal:36295 in memory (size: 23.9 KB, free:
1224.0 MB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0
on ip-172-31-3-141.ec2.internal:56428 in memory (size: 23.9 KB, free:
1224.9 MB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_217_piece0
on ip-172-31-3-140.ec2.internal:50542 in memory (size: 23.9 KB, free:
1224.7 MB)
16/02/02 21:22:52 INFO ContextCleaner: Cleaned accumulator 184
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0
on 172.31.3.140:50152 in memory (size: 3.0 KB, free: 2.1 GB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0
on ip-172-31-3-141.ec2.internal:41776 in memory (size: 3.0 KB, free:
1224.9 MB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0
on ip-172-31-3-141.ec2.internal:56428 in memory (size: 3.0 KB, free:
1224.9 MB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0
on ip-172-31-3-140.ec2.internal:36295 in memory (size: 3.0 KB, free:
1224.0 MB)
16/02/02 21:22:52 INFO BlockManagerInfo: Removed broadcast_218_piece0
on ip-172-31-3-140.ec2.internal:50542 in memory (size: 3.0 KB, free:
1224.7 MB)
16/02/02 21:25:00 INFO StateDStream: Marking RDD 680 for time
145444830 ms for checkpointing
16/02/02 21:25:00 INFO StateDStream: Marking RDD 708 for time
145444830 ms for checkpointing
16/02/02 21:25:00 INFO TransformedDStream: Slicing from 145444800
ms to 145444830 ms (aligned to 145444800 ms and 145444830
ms)
16/02/02 21:25:00 INFO StateDStream: Marking RDD 777 for time
145444830 ms for checkpointing
16/02/02 21:25:00 INFO StateDStream: Marking RDD 801 for time
145444830 ms for checkpointing
16/02/02 21:25:00 INFO JobScheduler: Added jobs for time 145444830 ms
16/02/02 21:25:00 INFO JobGenerator: Checkpointing graph for time
145444830 ms
16/02/02 21:25:00 INFO DStreamGraph: Updating checkpoint data for time
145444830 ms
16/02/02 21:25:00 INFO JobScheduler: Starting job streaming job
145444830 ms.0 from job set of time 145444830 ms

...


Please, this is really blocking in the upgrade process to latest Spark
versions and I really don't know how to work it around.

Any help would be very much appreciated.

Thank you,

Roberto

Re: Error trying to get DF for Hive table stored HBase

2016-02-02 Thread Ted Yu

Looks like this is related:
HIVE-12406

FYI

On Tue, Feb 2, 2016 at 1:40 PM, Doug Balog  wrote:

> I’m trying to create a DF for an external Hive table that is in HBase.
> I get the a NoSuchMethodError
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initSerdeParams(Lorg/apache/hadoop/conf/Configuration;Ljava/util/Properties;Ljava/lang/String;)Lorg/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe$SerDeParameters;
>
> I’m running Spark 1.6.0 on HDP 2.2.4-12-1 (Hive 0.14 and HBase 0.98.4) in
> secure mode.
>
> Anybody see this before ?
>
> Below is a stack trace and the hive table’s info.
>
> scala> sqlContext.table("item_data_lib.pcn_item")
> java.lang.NoSuchMethodError:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initSerdeParams(Lorg/apache/hadoop/conf/Configuration;Ljava/util/Properties;Ljava/lang/String;)Lorg/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe$SerDeParameters;
> at
> org.apache.hadoop.hive.hbase.HBaseSerDeParameters.(HBaseSerDeParameters.java:93)
> at
> org.apache.hadoop.hive.hbase.HBaseSerDe.initialize(HBaseSerDe.java:92)
> at
> org.apache.hadoop.hive.serde2.AbstractSerDe.initialize(AbstractSerDe.java:53)
> at
> org.apache.hadoop.hive.serde2.SerDeUtils.initializeSerDe(SerDeUtils.java:521)
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:391)
> at
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:276)
> at
> org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:258)
> at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:605)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:331)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$3.apply(ClientWrapper.scala:326)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:326)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:321)
> at
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:279)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.liftedTree1$1(ClientWrapper.scala:226)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:225)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:268)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:321)
> at
> org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:122)
> at
> org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
> at
> org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:384)
> at org.apache.spark.sql.hive.HiveContext$$anon$2.org
> $apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:457)
> at
> org.apache.spark.sql.catalyst.analysis.OverrideCatalog$class.lookupRelation(Catalog.scala:161)
> at
> org.apache.spark.sql.hive.HiveContext$$anon$2.lookupRelation(HiveContext.scala:457)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:831)
> at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>
>
> hive> show create table item_data_lib.pcn_item;
> OK
> CREATE EXTERNAL TABLE `item_data_lib.pcn_item`(
>   `key` string COMMENT 'from deserializer',
>   `p1` string COMMENT 'from deserializer',
>   `p2` string COMMENT 'from deserializer',
>   `p3` string COMMENT 'from deserializer',
>   `p4` string COMMENT 'from deserializer',
>   `p5` string COMMENT 'from deserializer',
>   `p6` string COMMENT 'from deserializer',
>   `p7` string COMMENT 'from deserializer',
>   `p8` string COMMENT 'from deserializer',
>   `p9` string COMMENT 'from deserializer',
>   `p10` string COMMENT 'from deserializer',
>   `p11` string COMMENT 'from deserializer',
>   `p12` string COMMENT 'from deserializer',
>   `p13` string COMMENT 'from deserializer',
>   `d1` string COMMENT 'from deserializer',
>   `d2` string COMMENT 'from deserializer',
>   `d3` string COMMENT 'from deserializer',
>   `d4` string COMMENT 'from deserializer',
>   `d5` string COMMENT 'from deserializer',
>   `d6` string COMMENT 'from deserializer',
>   `d7` string COMMENT 'from deserializer',
>   `d8` string COMMENT 'from deserializer',
>   `d9` string COMMENT 'from deserializer',
>   `d10` string COMMENT 'from deserializer',
>   `d11` string COMMENT 'from deserializer',
>   `d12` string COMMENT 'from deserializer',
>   `d13` string COMMENT 'from deserializer',
>   `d14` string COMMENT 'from deserializer',
>   `d15` string COMMENT 'from deserializer',
>   `d16` string COMMENT 'from deserializer

Re: Spark 1.5.2 memory error

2016-02-02 Thread Jim Green

Look at part#3 in below blog:
http://www.openkb.info/2015/06/resource-allocation-configurations-for.html

You may want to increase the executor memory, not just the
spark.yarn.executor.memoryOverhead.

On Tue, Feb 2, 2016 at 2:14 PM, Stefan Panayotov  wrote:

> For the memoryOvethead I have the default of 10% of 16g, and Spark version
> is 1.5.2.
>
>
>
> Stefan Panayotov, PhD
> Sent from Outlook Mail for Windows 10 phone
>
>
>
>
> *From: *Ted Yu 
> *Sent: *Tuesday, February 2, 2016 4:52 PM
> *To: *Jakob Odersky 
> *Cc: *Stefan Panayotov ; user@spark.apache.org
> *Subject: *Re: Spark 1.5.2 memory error
>
>
>
> What value do you use for spark.yarn.executor.memoryOverhead ?
>
>
>
> Please see https://spark.apache.org/docs/latest/running-on-yarn.html for
> description of the parameter.
>
>
>
> Which Spark release are you using ?
>
>
>
> Cheers
>
>
>
> On Tue, Feb 2, 2016 at 1:38 PM, Jakob Odersky  wrote:
>
> Can you share some code that produces the error? It is probably not
> due to spark but rather the way data is handled in the user code.
> Does your code call any reduceByKey actions? These are often a source
> for OOM errors.
>
>
> On Tue, Feb 2, 2016 at 1:22 PM, Stefan Panayotov 
> wrote:
> > Hi Guys,
> >
> > I need help with Spark memory errors when executing ML pipelines.
> > The error that I see is:
> >
> >
> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 32.0 in
> > stage 32.0 (TID 3298)
> >
> >
> > 16/02/02 20:34:17 INFO Executor: Executor is trying to kill task 12.0 in
> > stage 32.0 (TID 3278)
> >
> >
> > 16/02/02 20:34:39 INFO MemoryStore: ensureFreeSpace(2004728720) called
> with
> > curMem=296303415, maxMem=8890959790
> >
> >
> > 16/02/02 20:34:39 INFO MemoryStore: Block taskresult_3298 stored as
> bytes in
> > memory (estimated size 1911.9 MB, free 6.1 GB)
> >
> >
> > 16/02/02 20:34:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15:
> > SIGTERM
> >
> >
> > 16/02/02 20:34:39 ERROR Executor: Exception in task 12.0 in stage 32.0
> (TID
> > 3278)
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> >
> >
> >at java.util.Arrays.copyOf(Arrays.java:2271)
> >
> >
> >at
> > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
> >
> >
> >at
> >
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
> >
> >
> >at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >
> >at java.lang.Thread.run(Thread.java:745)
> >
> >
> > 16/02/02 20:34:39 INFO DiskBlockManager: Shutdown hook called
> >
> >
> > 16/02/02 20:34:39 INFO Executor: Finished task 32.0 in stage 32.0 (TID
> > 3298). 2004728720 bytes result sent via BlockManager)
> >
> >
> > 16/02/02 20:34:39 ERROR SparkUncaughtExceptionHandler: Uncaught
> exception in
> > thread Thread[Executor task launch worker-8,5,main]
> >
> >
> > java.lang.OutOfMemoryError: Java heap space
> >
> >
> >at java.util.Arrays.copyOf(Arrays.java:2271)
> >
> >
> >at
> > java.io.ByteArrayOutputStream.toByteArray(ByteArrayOutputStream.java:191)
> >
> >
> >at
> >
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:86)
> >
> >
> >at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:256)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> >
> >
> >at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> >
> >
> >at java.lang.Thread.run(Thread.java:745)
> >
> >
> > 16/02/02 20:34:39 INFO ShutdownHookManager: Shutdown hook called
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: Stopping azure-file-system
> metrics
> > system...
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSinkAdapter: azurefs2 thread interrupted.
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
> system
> > stopped.
> >
> >
> > 16/02/02 20:34:39 INFO MetricsSystemImpl: azure-file-system metrics
> system
> > shutdown complete.
> >
> >
> >
> >
> >
> > And …..
> >
> >
> >
> >
> >
> > 16/02/02 20:09:03 INFO impl.ContainerManagementProtocolProxy: Opening
> proxy
> > : 10.0.0.5:30050
> >
> >
> > 16/02/02 20:33:51 INFO yarn.YarnAllocator: Completed container
> > container_1454421662639_0011_01_05 (state: COMPLETE, exit status:
> -104)
> >
> >
> > 16/02/02 20:33:51 WARN yarn.YarnAllocator: Container killed by YARN for
> > exceeding memory limits. 16.8 GB of 16.5 GB physical memory used.
> Consider
> > boosting spark.yarn.executor.memoryOverhead.
> >
> >
> > 16/02/02 20:33:56 INFO yarn.YarnAllocator: Will request 1 executor
> > containers, each with 2 cores and 16768 MB memory including 384 MB
> overhead
> >
> >
> > 16/02/02 20:33:56 INFO yarn.Y

RE: how to introduce spark to your colleague if he has no background about *** spark related

2016-02-02 Thread Mohammed Guller

Hi Charles,

You may find slides 16-20 from this deck useful:
http://www.slideshare.net/mg007/big-data-trends-challenges-opportunities-57744483

I used it for a talk that I gave to MS students last week. I wanted to give 
them some context before describing Spark.

It doesn’t cover all the stuff that you have on your agenda, but I would be 
happy to guide you. Feel free to send me a direct email.

Mohammed
Author: Big Data Analytics with 
Spark

From: Xiao Li [mailto:gatorsm...@gmail.com]
Sent: Sunday, January 31, 2016 10:41 PM
To: Jörn Franke
Cc: charles li; user
Subject: Re: how to introduce spark to your colleague if he has no background 
about *** spark related

My 2 cents. Concepts are always boring to the people with zero background. Use 
examples to show how easy and powerful Spark is! Use cases are also useful for 
them. Downloaded the slides in Spark summit. I believe you can find a lot of 
interesting ideas!

Tomorrow, I am facing similar issues, but the audiences are three RDBMS engine 
experts. I will go over the paper Spark SQL in Sigmod 2015 with them and show 
them the source codes.

Good luck!

Xiao Li

2016-01-31 22:35 GMT-08:00 Jörn Franke 
mailto:jornfra...@gmail.com>>:
It depends of course on the background of the people but how about some 
examples ("word count") how it works in the background.

On 01 Feb 2016, at 07:31, charles li 
mailto:charles.up...@gmail.com>> wrote:

Apache Spark™ is a fast and general engine for large-scale data processing.

it's a good profile of spark, but it's really too short for lots of people if 
then have little background in this field.

ok, frankly, I'll give a tech-talk about spark later this week, and now I'm 
writing a slide about that, but I'm stuck at the first slide.


I'm going to talk about three question about spark in the first part of my 
talk, for most of my colleagues has no background on spark, hadoop, so I want 
to talk :

1. the background of birth of spark
2. pros and cons of spark, or the situations that spark is going to handle, or 
why we use spark
3. the basic principles of spark,
4. the basic conceptions of spark

have anyone met kinds of this problem, introduce spark to one who has no 
background on your field? and I hope you can tell me how you handle this 
problem at that time, or give some ideas about the 4 sections mentioned above.


great thanks.


--
--
a spark lover, a quant, a developer and a good man.

http://github.com/litaotao

recommendProductsForUser for a subset of user

2016-02-02 Thread Roberto Pagliari

When using ALS, is it possible to use recommendProductsForUser for a subset of 
users?

Currently, productFeatures and userFeatures are val. Is there a workaround for 
it? Using recommendForUser repeatedly would not work in my case, since it would 
be too slow with many users.


Thank you,

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel

so latest optimizations done on spark 1.4 and 1.5 releases are mostly from
project Tungsten. Docs says it usues sun.misc.unsafe to convert physical
rdd structure into byte array at some point for optimized GC and memory. My
question is why is it only applicable to SQL/Dataframe and not RDD? RDD has
types too!


On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel  wrote:

> I haven't gone through much details of spark catalyst optimizer and
> tungston project but we have been advised by databricks support to use
> DataFrame to resolve issues with OOM error that we are getting during Join
> and GroupBy operations. We use spark 1.3.1 and looks like it can not
> perform external sort and blows with OOM.
>
> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>
> Now it's great that it has been addressed in spark 1.5 release but why
> databricks advocating to switch to DataFrames? It may make sense for batch
> jobs or near real-time jobs but not sure if they do when you are developing
> real time analytics where you want to optimize every millisecond that you
> can. Again I am still knowledging myself with DataFrame APIs and
> optimizations and I will benchmark it against RDD for our batch and
> real-time use case as well.
>
> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra 
> wrote:
>
>> What do you think is preventing you from optimizing your own RDD-level
>> transformations and actions?  AFAIK, nothing that has been added in
>> Catalyst precludes you from doing that.  The fact of the matter is, though,
>> that there is less type and semantic information available to Spark from
>> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
>> means that Spark itself can't optimize for raw RDDs the same way that it
>> can for higher-level constructs that can leverage Catalyst; but if you want
>> to write your own optimizations based on your own knowledge of the data
>> types and semantics that are hiding in your raw RDDs, there's no reason
>> that you can't do that.
>>
>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> Perhaps I should write a blog about this that why spark is focusing more
>>> on writing easier spark jobs and hiding underlaying performance
>>> optimization details from a seasoned spark users. It's one thing to provide
>>> such abstract framework that does optimization for you so you don't have to
>>> worry about it as a data scientist or data analyst but what about
>>> developers who do not want overhead of SQL and Optimizers and unnecessary
>>> abstractions ! Application designer who knows their data and queries should
>>> be able to optimize at RDD level transformations and actions. Does spark
>>> provides a way to achieve same level of optimization by using either SQL
>>> Catalyst or raw RDD transformation?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> [image: What's New with Xactly] 
>>>
>>>   [image: LinkedIn]
>>>   [image: Twitter]
>>>   [image: Facebook]
>>>   [image: YouTube]
>>> 
>>
>>
>>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]

Spark 1.5.2 - are new Project Tungsten optimizations available on RDD as well?

2016-02-02 Thread Nirav Patel

Hi,

I read about release notes and few slideshares on latest optimizations done
on spark 1.4 and 1.5 releases. Part of which are optimizations from project
Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd
structure into byte array before shuffle for optimized GC and memory. My
question is why is it only applicable to SQL/Dataframe and not RDD? RDD has
types too!

Thanks,
Nirav

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel

Sure, having a common distributed query and compute engine for all kind of
data source is alluring concept to market and advertise and to attract
potential customers (non engineers, analyst, data scientist). But it's
nothing new!..but darn old school. it's taking bits and pieces from
existing sql and no-sql technology. It lacks many panache of robust sql
engine. I think what put spark aside from everything else on market is RDD!
and flexibility and scala-like programming style given to developers which
is simply much more attractive to write then sql syntaxes, schema and
string constants that falls apart left and right. Writing sql is old
school. period.  good luck making money though :)

On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers  wrote:

> To have a product databricks can charge for their sql engine needs to be
> competitive. That's why they have these optimizations in catalyst. RDD is
> simply no longer the focus.
> On Feb 2, 2016 7:17 PM, "Nirav Patel"  wrote:
>
>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>> physical rdd structure into byte array at some point for optimized GC and
>> memory. My question is why is it only applicable to SQL/Dataframe and not
>> RDD? RDD has types too!
>>
>>
>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel 
>> wrote:
>>
>>> I haven't gone through much details of spark catalyst optimizer and
>>> tungston project but we have been advised by databricks support to use
>>> DataFrame to resolve issues with OOM error that we are getting during Join
>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not
>>> perform external sort and blows with OOM.
>>>
>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>>>
>>> Now it's great that it has been addressed in spark 1.5 release but why
>>> databricks advocating to switch to DataFrames? It may make sense for batch
>>> jobs or near real-time jobs but not sure if they do when you are developing
>>> real time analytics where you want to optimize every millisecond that you
>>> can. Again I am still knowledging myself with DataFrame APIs and
>>> optimizations and I will benchmark it against RDD for our batch and
>>> real-time use case as well.
>>>
>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra 
>>> wrote:
>>>
 What do you think is preventing you from optimizing your own RDD-level
 transformations and actions?  AFAIK, nothing that has been added in
 Catalyst precludes you from doing that.  The fact of the matter is, though,
 that there is less type and semantic information available to Spark from
 the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
 means that Spark itself can't optimize for raw RDDs the same way that it
 can for higher-level constructs that can leverage Catalyst; but if you want
 to write your own optimizations based on your own knowledge of the data
 types and semantics that are hiding in your raw RDDs, there's no reason
 that you can't do that.

 On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel 
 wrote:

> Hi,
>
> Perhaps I should write a blog about this that why spark is focusing
> more on writing easier spark jobs and hiding underlaying performance
> optimization details from a seasoned spark users. It's one thing to 
> provide
> such abstract framework that does optimization for you so you don't have 
> to
> worry about it as a data scientist or data analyst but what about
> developers who do not want overhead of SQL and Optimizers and unnecessary
> abstractions ! Application designer who knows their data and queries 
> should
> be able to optimize at RDD level transformations and actions. Does spark
> provides a way to achieve same level of optimization by using either SQL
> Catalyst or raw RDD transformation?
>
> Thanks
>
>
>
>
>
> [image: What's New with Xactly]
> 
>
>   [image: LinkedIn]
>   [image:
> Twitter]   [image: Facebook]
>   [image: YouTube]
> 



>>>
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>
>

-- 


[image: What's New with Xactly] 

  [image: LinkedIn]

Best way to process large number of (non-text) files in deeply nested folder hierarchy

2016-02-02 Thread Boris Capitanu

Hello,

I find myself in need of being able to process a large number of files (28M) 
stored in a deeply nested folder hierarchy (Pairtree... a multi-level 
hashtable-on-disk -like structure). Here's an example path: 

./udel/pairtree_root/31/74/11/11/56/89/39/3174568939/3174568939.zip

I can't scan the entire folder structure ahead of time to build a list of 
files, and then use sc.parallelize(list, ...) to create and RDD to process, 
because traversing the entire folder structure would take a very long time. 
(also, the folder content is fluid, meaning that files are added and deleted 
periodically)  I'm thinking I would need to use Spark Streaming whereby I use 
something like java.nio.Files.walkFileTree(...) to "discover" these files and 
"stream" them into Spark as they're found.

What do the experts think?  Is there a better way of handling this?
The unit of parallelization is a single file (i.e. the "processing" operates at 
a file level).

Here's what I've created so far:

class DirectoryExplorerReceiver(dir: String, accept: (String, 
BasicFileAttributes) => Boolean) 
  extends Receiver[String](StorageLevel.MEMORY_AND_DISK_2) {
  
  override def onStart(): Unit = {
val context = scala.concurrent.ExecutionContext.Implicits.global
val rootPath = FileSystems.getDefault.getPath(dir)

context.execute(new Runnable {
  override def run(): Unit =
Files.walkFileTree(rootPath, new SimpleFileVisitor[Path] {
  override def visitFile(file: Path, attrs: BasicFileAttributes): 
FileVisitResult = {
if (isStopped()) 
  return FileVisitResult.TERMINATE

if (accept(file.toString, attrs))
  store(file.toFile.getPath)

FileVisitResult.CONTINUE
  }
})
})
  }

  override def onStop(): Unit = {}
}

And use it like this:

val ssc = new StreamingContext(conf, Seconds(10))
val zipFiles = ssc.receiverStream(
  new DirectoryExplorerReceiver(
dataDir,
(filePath, attrs) =>
attrs.isRegularFile &&
filePath.toLowerCase.endsWith(".zip") &&
attrs.size() > 0
  )
)

val processedData = zipFiles.map(zipFile => doSomethingUseful(zipFile))

This basically uses a 10 second window to "stream" all files discovered during 
that window.

Thanks for any comments/suggestions.

-Boris

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers

Dataset will have access to some of the catalyst/tungsten optimizations
while also giving you scala and types. However that is currently
experimental and not yet as efficient as it could be.
On Feb 2, 2016 7:50 PM, "Nirav Patel"  wrote:

> Sure, having a common distributed query and compute engine for all kind of
> data source is alluring concept to market and advertise and to attract
> potential customers (non engineers, analyst, data scientist). But it's
> nothing new!..but darn old school. it's taking bits and pieces from
> existing sql and no-sql technology. It lacks many panache of robust sql
> engine. I think what put spark aside from everything else on market is RDD!
> and flexibility and scala-like programming style given to developers which
> is simply much more attractive to write then sql syntaxes, schema and
> string constants that falls apart left and right. Writing sql is old
> school. period.  good luck making money though :)
>
> On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers  wrote:
>
>> To have a product databricks can charge for their sql engine needs to be
>> competitive. That's why they have these optimizations in catalyst. RDD is
>> simply no longer the focus.
>> On Feb 2, 2016 7:17 PM, "Nirav Patel"  wrote:
>>
>>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>>> physical rdd structure into byte array at some point for optimized GC and
>>> memory. My question is why is it only applicable to SQL/Dataframe and not
>>> RDD? RDD has types too!
>>>
>>>
>>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel 
>>> wrote:
>>>
 I haven't gone through much details of spark catalyst optimizer and
 tungston project but we have been advised by databricks support to use
 DataFrame to resolve issues with OOM error that we are getting during Join
 and GroupBy operations. We use spark 1.3.1 and looks like it can not
 perform external sort and blows with OOM.

 https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html

 Now it's great that it has been addressed in spark 1.5 release but why
 databricks advocating to switch to DataFrames? It may make sense for batch
 jobs or near real-time jobs but not sure if they do when you are developing
 real time analytics where you want to optimize every millisecond that you
 can. Again I am still knowledging myself with DataFrame APIs and
 optimizations and I will benchmark it against RDD for our batch and
 real-time use case as well.

 On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra 
 wrote:

> What do you think is preventing you from optimizing your own RDD-level
> transformations and actions?  AFAIK, nothing that has been added in
> Catalyst precludes you from doing that.  The fact of the matter is, 
> though,
> that there is less type and semantic information available to Spark from
> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
> means that Spark itself can't optimize for raw RDDs the same way that it
> can for higher-level constructs that can leverage Catalyst; but if you 
> want
> to write your own optimizations based on your own knowledge of the data
> types and semantics that are hiding in your raw RDDs, there's no reason
> that you can't do that.
>
> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel 
> wrote:
>
>> Hi,
>>
>> Perhaps I should write a blog about this that why spark is focusing
>> more on writing easier spark jobs and hiding underlaying performance
>> optimization details from a seasoned spark users. It's one thing to 
>> provide
>> such abstract framework that does optimization for you so you don't have 
>> to
>> worry about it as a data scientist or data analyst but what about
>> developers who do not want overhead of SQL and Optimizers and unnecessary
>> abstractions ! Application designer who knows their data and queries 
>> should
>> be able to optimize at RDD level transformations and actions. Does spark
>> provides a way to achieve same level of optimization by using either SQL
>> Catalyst or raw RDD transformation?
>>
>> Thanks
>>
>>
>>
>>
>>
>> [image: What's New with Xactly]
>> 
>>
>>   [image: LinkedIn]
>>   [image:
>> Twitter]   [image: Facebook]
>>   [image: YouTube]
>> 
>
>
>

>>>
>>>
>>>
>>> [image: What's New with Xactly] 
>>>
>>>   [image: LinkedIn]
>>>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jakob Odersky

To address one specific question:

> Docs says it usues sun.misc.unsafe to convert physical rdd structure into
byte array at some point for optimized GC and memory. My question is why is
it only applicable to SQL/Dataframe and not RDD? RDD has types too!

A principal difference between RDDs and DataFrames/Datasets is that the
latter have a schema associated to them. This means that they support only
certain types (primitives, case classes and more) and that they are
uniform, whereas RDDs can contain any serializable object and must not
necessarily be uniform. These properties make it possible to generate very
efficient serialization and other optimizations that cannot be achieved
with plain RDDs.

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Michael Armbrust

>
> A principal difference between RDDs and DataFrames/Datasets is that the
> latter have a schema associated to them. This means that they support only
> certain types (primitives, case classes and more) and that they are
> uniform, whereas RDDs can contain any serializable object and must not
> necessarily be uniform. These properties make it possible to generate very
> efficient serialization and other optimizations that cannot be achieved
> with plain RDDs.
>

You can use Encoder.kryo() as well to serialize arbitrary objects, just
like with RDDs.

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam

I think spark dataframe supports more than just SQL. It is more like pandas
dataframe.( I rarely use the SQL feature. )
There are a lot of novelties in dataframe so I think it is quite optimize
for many tasks. The in-memory data structure is very memory efficient. I
just change a very slow RDD program to use Dataframe. The performance gain
is about 2 times while using less CPU. Of course, if you are very good at
optimizing your code, then use pure RDD.


On Tue, Feb 2, 2016 at 8:08 PM, Koert Kuipers  wrote:

> Dataset will have access to some of the catalyst/tungsten optimizations
> while also giving you scala and types. However that is currently
> experimental and not yet as efficient as it could be.
> On Feb 2, 2016 7:50 PM, "Nirav Patel"  wrote:
>
>> Sure, having a common distributed query and compute engine for all kind
>> of data source is alluring concept to market and advertise and to attract
>> potential customers (non engineers, analyst, data scientist). But it's
>> nothing new!..but darn old school. it's taking bits and pieces from
>> existing sql and no-sql technology. It lacks many panache of robust sql
>> engine. I think what put spark aside from everything else on market is RDD!
>> and flexibility and scala-like programming style given to developers which
>> is simply much more attractive to write then sql syntaxes, schema and
>> string constants that falls apart left and right. Writing sql is old
>> school. period.  good luck making money though :)
>>
>> On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers  wrote:
>>
>>> To have a product databricks can charge for their sql engine needs to be
>>> competitive. That's why they have these optimizations in catalyst. RDD is
>>> simply no longer the focus.
>>> On Feb 2, 2016 7:17 PM, "Nirav Patel"  wrote:
>>>
 so latest optimizations done on spark 1.4 and 1.5 releases are mostly
 from project Tungsten. Docs says it usues sun.misc.unsafe to convert
 physical rdd structure into byte array at some point for optimized GC and
 memory. My question is why is it only applicable to SQL/Dataframe and not
 RDD? RDD has types too!


 On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel 
 wrote:

> I haven't gone through much details of spark catalyst optimizer and
> tungston project but we have been advised by databricks support to use
> DataFrame to resolve issues with OOM error that we are getting during Join
> and GroupBy operations. We use spark 1.3.1 and looks like it can not
> perform external sort and blows with OOM.
>
> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>
> Now it's great that it has been addressed in spark 1.5 release but why
> databricks advocating to switch to DataFrames? It may make sense for batch
> jobs or near real-time jobs but not sure if they do when you are 
> developing
> real time analytics where you want to optimize every millisecond that you
> can. Again I am still knowledging myself with DataFrame APIs and
> optimizations and I will benchmark it against RDD for our batch and
> real-time use case as well.
>
> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra  > wrote:
>
>> What do you think is preventing you from optimizing your
>> own RDD-level transformations and actions?  AFAIK, nothing that has been
>> added in Catalyst precludes you from doing that.  The fact of the matter
>> is, though, that there is less type and semantic information available to
>> Spark from the raw RDD API than from using Spark SQL, DataFrames or
>> DataSets.  That means that Spark itself can't optimize for raw RDDs the
>> same way that it can for higher-level constructs that can leverage
>> Catalyst; but if you want to write your own optimizations based on your 
>> own
>> knowledge of the data types and semantics that are hiding in your raw 
>> RDDs,
>> there's no reason that you can't do that.
>>
>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> Perhaps I should write a blog about this that why spark is focusing
>>> more on writing easier spark jobs and hiding underlaying performance
>>> optimization details from a seasoned spark users. It's one thing to 
>>> provide
>>> such abstract framework that does optimization for you so you don't 
>>> have to
>>> worry about it as a data scientist or data analyst but what about
>>> developers who do not want overhead of SQL and Optimizers and 
>>> unnecessary
>>> abstractions ! Application designer who knows their data and queries 
>>> should
>>> be able to optimize at RDD level transformations and actions. Does spark
>>> provides a way to achieve same level of optimization by using either SQL
>>> Catalyst or raw RDD transformation?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>> [

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam

Hi Michael,

Is there a section in the spark documentation demonstrate how to serialize
arbitrary objects in Dataframe? The last time I did was using some User
Defined Type (copy from VectorUDT).

Best Regards,

Jerry

On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust 
wrote:

> A principal difference between RDDs and DataFrames/Datasets is that the
>> latter have a schema associated to them. This means that they support only
>> certain types (primitives, case classes and more) and that they are
>> uniform, whereas RDDs can contain any serializable object and must not
>> necessarily be uniform. These properties make it possible to generate very
>> efficient serialization and other optimizations that cannot be achieved
>> with plain RDDs.
>>
>
> You can use Encoder.kryo() as well to serialize arbitrary objects, just
> like with RDDs.
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel

I dont understand why one thinks RDD of case object doesn't have
types(schema) ? If spark can convert RDD to DataFrame which means it
understood the schema. SO then from that point why one has to use SQL
features to do further processing? If all spark need for optimizations is
schema then what this additional SQL features buys ? If there is a way to
avoid SQL feature using DataFrame I don't mind it. But looks like I have to
convert all my existing transformation to things like
df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
and error prone in my opinion.

On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam  wrote:

> Hi Michael,
>
> Is there a section in the spark documentation demonstrate how to serialize
> arbitrary objects in Dataframe? The last time I did was using some User
> Defined Type (copy from VectorUDT).
>
> Best Regards,
>
> Jerry
>
> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust 
> wrote:
>
>> A principal difference between RDDs and DataFrames/Datasets is that the
>>> latter have a schema associated to them. This means that they support only
>>> certain types (primitives, case classes and more) and that they are
>>> uniform, whereas RDDs can contain any serializable object and must not
>>> necessarily be uniform. These properties make it possible to generate very
>>> efficient serialization and other optimizations that cannot be achieved
>>> with plain RDDs.
>>>
>>
>> You can use Encoder.kryo() as well to serialize arbitrary objects, just
>> like with RDDs.
>>
>
>

-- 

[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Jerry Lam

Hi Nirav,
I'm sure you read this?
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

There is a benchmark in the article to show that dataframe "can" outperform
RDD implementation by 2 times. Of course, benchmarks can be "made". But
from the code snippet you wrote, I "think" dataframe will choose between
different join implementation based on the data statistics.

I cannot comment on the beauty of it because "beauty is in the eye of the
beholder" LOL
Regarding the comment on error prone, can you say why you think it is the
case? Relative to what other ways?

Best Regards,

Jerry


On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel  wrote:

> I dont understand why one thinks RDD of case object doesn't have
> types(schema) ? If spark can convert RDD to DataFrame which means it
> understood the schema. SO then from that point why one has to use SQL
> features to do further processing? If all spark need for optimizations is
> schema then what this additional SQL features buys ? If there is a way to
> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
> convert all my existing transformation to things like
> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
> and error prone in my opinion.
>
> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam  wrote:
>
>> Hi Michael,
>>
>> Is there a section in the spark documentation demonstrate how to
>> serialize arbitrary objects in Dataframe? The last time I did was using
>> some User Defined Type (copy from VectorUDT).
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust 
>> wrote:
>>
>>> A principal difference between RDDs and DataFrames/Datasets is that the
 latter have a schema associated to them. This means that they support only
 certain types (primitives, case classes and more) and that they are
 uniform, whereas RDDs can contain any serializable object and must not
 necessarily be uniform. These properties make it possible to generate very
 efficient serialization and other optimizations that cannot be achieved
 with plain RDDs.

>>>
>>> You can use Encoder.kryo() as well to serialize arbitrary objects, just
>>> like with RDDs.
>>>
>>
>>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 
>

Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot

Hi,
Does Spark supports dyamic sql ?
Would really appreciate the help , if any one could share some
references/examples.



Thanks,
Divya

question on spark.streaming.kafka.maxRetries

2016-02-02 Thread Chen Song

For Kafka direct stream, is there a way to set the time between successive
retries? From my testing, it looks like it is 200ms. Any way I can increase
the time?

Re: Dynamic sql in Spark 1.5

2016-02-02 Thread Ali Tajeldin EDU

While you can construct the SQL string dynamically in scala/java/python, it 
would be best to use the Dataframe API for creating dynamic SQL queries.  See 
http://spark.apache.org/docs/1.5.2/sql-programming-guide.html for details.

On Feb 2, 2016, at 6:49 PM, Divya Gehlot  wrote:

> Hi,
> Does Spark supports dyamic sql ?
> Would really appreciate the help , if any one could share some 
> references/examples.
> 
> 
> 
> Thanks,
> Divya

Re: Master failover results in running job marked as "WAITING"

2016-02-02 Thread Anthony Tang

 blockquote, div.yahoo_quoted { margin-left: 0 !important; border-left:1px 
#715FFA solid !important;  padding-left:1ex !important; background-color:white 
!important; }  Yes, it's the IP address/host.


Sent from Yahoo Mail for iPad


On Tuesday, February 2, 2016, 8:04 AM, Ted Yu  wrote:

bq. Failed to connect to master XXX:7077
Is the 'XXX' above the hostname for the new master ?
Thanks
On Tue, Feb 2, 2016 at 1:48 AM, Anthony Tang  wrote:

Hi - 

I'm running Spark 1.5.2 in standalone mode with multiple masters using 
zookeeper for failover.  The master fails over correctly to the standby when it 
goes down, and running applications seem to continue to run, but in the new 
active master web UI, they are marked as "WAITING", and the workers have these 
entries in their logs: 

16/01/30 00:51:13 ERROR Worker: Connection to master failed! Waiting for master 
to reconnect... 
16/01/30 00:51:13 WARN Worker: Failed to connect to master XXX:7077 
akka.actor.ActorNotFound: Actor not found for: 
ActorSelection[Anchor(akka.tcp://sparkMaster@XXX:7077/), Path(/user/Master)] 

Should they be "RUNNING" still? One time, it looked like the job stopped 
functioning (This is a continuously running streaming job), but I haven't been 
able to reproduce it.  FWIW, the driver that started it is still marked as 
"RUNNING". 

Thanks. 
- Anthony

Union of RDDs without the overhead of Union

2016-02-02 Thread Jerry Lam

Hi Spark users and developers,

anyone knows how to union two RDDs without the overhead of it?

say rdd1.union(rdd2).saveTextFile(..)
This requires a stage to union the 2 rdds before saveAsTextFile (2 stages).
Is there a way to skip the union step but have the contents of the two rdds
save to the same output text file?

Thank you!

Jerry

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers

well the "hadoop" way is to save to a/b and a/c and read from a/* :)

On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam  wrote:

> Hi Spark users and developers,
>
> anyone knows how to union two RDDs without the overhead of it?
>
> say rdd1.union(rdd2).saveTextFile(..)
> This requires a stage to union the 2 rdds before saveAsTextFile (2
> stages). Is there a way to skip the union step but have the contents of the
> two rdds save to the same output text file?
>
> Thank you!
>
> Jerry
>

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Koert Kuipers

i am surprised union introduces a stage. UnionRDD should have only narrow
dependencies.

On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers  wrote:

> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>
> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam  wrote:
>
>> Hi Spark users and developers,
>>
>> anyone knows how to union two RDDs without the overhead of it?
>>
>> say rdd1.union(rdd2).saveTextFile(..)
>> This requires a stage to union the 2 rdds before saveAsTextFile (2
>> stages). Is there a way to skip the union step but have the contents of the
>> two rdds save to the same output text file?
>>
>> Thank you!
>>
>> Jerry
>>
>
>

Overriding toString and hashCode with Spark streaming

2016-02-02 Thread N B

Hello,

In our Spark streaming application, we are forming DStreams made of objects
a rather large composite class. I have discovered that in order to do some
operations like RDD.subtract(), they are only successful for complex
objects such as these by overriding toString() and hashCode() methods for
this class.

Are there any issues that we should be aware of in general for doing so in
a Spark program?

Thanks
NB

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Nirav Patel

Hi Jerry,

Yes I read that benchmark. And doesn't help in most cases. I'll give you
example of one of our application. It's a memory hogger by nature since it
works on groupByKey and performs combinatorics on Iterator. So it maintain
few structures inside task. It works on mapreduce with half the resources I
am giving it for spark and Spark keeps throwing OOM on a pre-step which is
a simple join! I saw every task was done at process_local locality still
join keeps failing due to container being killed. and container gets killed
due to oom.  We have a case with Databricks/Mapr on that for more then a
month. anyway don't wanna distract there. I can believe that changing to
DataFrame and it's computing model can bring performance but I was hoping
that wouldn't be your answer to every performance problem.

Let me ask this - If I decide to stick with RDD do I still have flexibility
to choose what Join implementation I can use? And similar underlaying
construct to best execute my jobs.

I said error prone because you need to write column qualifiers instead of
referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1';
more over multiple tables having similar column names causing parsing
issues; and when you start writing constants for your columns it just
become another schema maintenance inside your app. It feels like thing of
past. Query engine(distributed or not) is old school as I 'see' it :)

Thanks for being patient.
Nirav

On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam  wrote:

> Hi Nirav,
> I'm sure you read this?
> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>
> There is a benchmark in the article to show that dataframe "can"
> outperform RDD implementation by 2 times. Of course, benchmarks can be
> "made". But from the code snippet you wrote, I "think" dataframe will
> choose between different join implementation based on the data statistics.
>
> I cannot comment on the beauty of it because "beauty is in the eye of the
> beholder" LOL
> Regarding the comment on error prone, can you say why you think it is the
> case? Relative to what other ways?
>
> Best Regards,
>
> Jerry
>
>
> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel  wrote:
>
>> I dont understand why one thinks RDD of case object doesn't have
>> types(schema) ? If spark can convert RDD to DataFrame which means it
>> understood the schema. SO then from that point why one has to use SQL
>> features to do further processing? If all spark need for optimizations is
>> schema then what this additional SQL features buys ? If there is a way to
>> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
>> convert all my existing transformation to things like
>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>> and error prone in my opinion.
>>
>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam  wrote:
>>
>>> Hi Michael,
>>>
>>> Is there a section in the spark documentation demonstrate how to
>>> serialize arbitrary objects in Dataframe? The last time I did was using
>>> some User Defined Type (copy from VectorUDT).
>>>
>>> Best Regards,
>>>
>>> Jerry
>>>
>>> On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust >> > wrote:
>>>
 A principal difference between RDDs and DataFrames/Datasets is that the
> latter have a schema associated to them. This means that they support only
> certain types (primitives, case classes and more) and that they are
> uniform, whereas RDDs can contain any serializable object and must not
> necessarily be uniform. These properties make it possible to generate very
> efficient serialization and other optimizations that cannot be achieved
> with plain RDDs.
>

 You can use Encoder.kryo() as well to serialize arbitrary objects, just
 like with RDDs.

>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>>   [image: YouTube]
>> 
>>
>
>

-- 

[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]

Re: Union of RDDs without the overhead of Union

2016-02-02 Thread Rishi Mishra

Agree with Koert that UnionRDD should have a narrow dependencies .
Although union of two RDDs increases the number of tasks to be executed (
rdd1.partitions + rdd2.partitions) .
If your two RDDs have same number of partitions , you can also use
zipPartitions, which causes lesser number of tasks, hence less overhead.

On Wed, Feb 3, 2016 at 9:58 AM, Koert Kuipers  wrote:

> i am surprised union introduces a stage. UnionRDD should have only narrow
> dependencies.
>
> On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers  wrote:
>
>> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>>
>> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam  wrote:
>>
>>> Hi Spark users and developers,
>>>
>>> anyone knows how to union two RDDs without the overhead of it?
>>>
>>> say rdd1.union(rdd2).saveTextFile(..)
>>> This requires a stage to union the 2 rdds before saveAsTextFile (2
>>> stages). Is there a way to skip the union step but have the contents of the
>>> two rdds save to the same output text file?
>>>
>>> Thank you!
>>>
>>> Jerry
>>>
>>
>>
>

-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

2016-02-02 Thread Koert Kuipers

with respect to joins, unfortunately not all implementations are available.
for example i would like to use joins where one side is streaming (and the
other cached). this seems to be available for DataFrame but not for RDD.

On Wed, Feb 3, 2016 at 12:19 AM, Nirav Patel  wrote:

> Hi Jerry,
>
> Yes I read that benchmark. And doesn't help in most cases. I'll give you
> example of one of our application. It's a memory hogger by nature since it
> works on groupByKey and performs combinatorics on Iterator. So it maintain
> few structures inside task. It works on mapreduce with half the resources I
> am giving it for spark and Spark keeps throwing OOM on a pre-step which is
> a simple join! I saw every task was done at process_local locality still
> join keeps failing due to container being killed. and container gets killed
> due to oom.  We have a case with Databricks/Mapr on that for more then a
> month. anyway don't wanna distract there. I can believe that changing to
> DataFrame and it's computing model can bring performance but I was hoping
> that wouldn't be your answer to every performance problem.
>
> Let me ask this - If I decide to stick with RDD do I still have
> flexibility to choose what Join implementation I can use? And similar
> underlaying construct to best execute my jobs.
>
> I said error prone because you need to write column qualifiers instead of
> referencing fields. i.e. 'caseObj("field1")' instead of 'caseObj.field1';
> more over multiple tables having similar column names causing parsing
> issues; and when you start writing constants for your columns it just
> become another schema maintenance inside your app. It feels like thing of
> past. Query engine(distributed or not) is old school as I 'see' it :)
>
> Thanks for being patient.
> Nirav
>
>
>
>
>
> On Tue, Feb 2, 2016 at 6:26 PM, Jerry Lam  wrote:
>
>> Hi Nirav,
>> I'm sure you read this?
>> https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html
>>
>> There is a benchmark in the article to show that dataframe "can"
>> outperform RDD implementation by 2 times. Of course, benchmarks can be
>> "made". But from the code snippet you wrote, I "think" dataframe will
>> choose between different join implementation based on the data statistics.
>>
>> I cannot comment on the beauty of it because "beauty is in the eye of the
>> beholder" LOL
>> Regarding the comment on error prone, can you say why you think it is the
>> case? Relative to what other ways?
>>
>> Best Regards,
>>
>> Jerry
>>
>>
>> On Tue, Feb 2, 2016 at 8:59 PM, Nirav Patel 
>> wrote:
>>
>>> I dont understand why one thinks RDD of case object doesn't have
>>> types(schema) ? If spark can convert RDD to DataFrame which means it
>>> understood the schema. SO then from that point why one has to use SQL
>>> features to do further processing? If all spark need for optimizations is
>>> schema then what this additional SQL features buys ? If there is a way to
>>> avoid SQL feature using DataFrame I don't mind it. But looks like I have to
>>> convert all my existing transformation to things like
>>> df1.join(df2,df1('abc') == df2('abc'), 'left_outer') .. that's plain ugly
>>> and error prone in my opinion.
>>>
>>> On Tue, Feb 2, 2016 at 5:49 PM, Jerry Lam  wrote:
>>>
 Hi Michael,

 Is there a section in the spark documentation demonstrate how to
 serialize arbitrary objects in Dataframe? The last time I did was using
 some User Defined Type (copy from VectorUDT).

 Best Regards,

 Jerry

 On Tue, Feb 2, 2016 at 8:46 PM, Michael Armbrust <
 mich...@databricks.com> wrote:

> A principal difference between RDDs and DataFrames/Datasets is that
>> the latter have a schema associated to them. This means that they support
>> only certain types (primitives, case classes and more) and that they are
>> uniform, whereas RDDs can contain any serializable object and must not
>> necessarily be uniform. These properties make it possible to generate 
>> very
>> efficient serialization and other optimizations that cannot be achieved
>> with plain RDDs.
>>
>
> You can use Encoder.kryo() as well to serialize arbitrary objects,
> just like with RDDs.
>


>>>
>>>
>>>
>>> [image: What's New with Xactly] 
>>>
>>>   [image: LinkedIn]
>>>   [image: Twitter]
>>>   [image: Facebook]
>>>   [image: YouTube]
>>> 
>>>
>>
>>
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>

make-distribution fails due to wrong order of modules

2016-02-02 Thread Koert Kuipers

i am seeing make-distribution fail because lib_managed does not exist. what
seems to happen is that sql/hive module gets build and creates this
directory. but after this sometime later module spark-parent gets build,
which includes:

[INFO] Building Spark Project Parent POM 1.6.0-SNAPSHOT
[INFO]

[INFO]
[INFO] --- maven-clean-plugin:2.6.1:clean (default-clean) @
spark-parent_2.11 ---
[INFO] Deleting /home/koert/src/spark/target
[INFO] Deleting /home/koert/src/spark/lib_managed (includes = [], excludes
= [])

so after maven is done by the time make-distribution tries to access
lib_managed its gone.

now this is with a very slightly modified spark 1.6.0 build (one extra
 in module assembly). so i am guessing i caused it. anyone know
enough maven foo to tell me how to fix this? i think i need to get the
order of the modules fixed so spark-parent gets build before sql/hive?

thanks! koert

how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Divya Gehlot

Hi,

I would like to know how to calculate how much  -executor-memory should we
allocate , how many num-executors,total-executor-cores we should give while
submitting spark jobs .
Is there any formula for it ?


Thanks,
Divya

Re: how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Jia Zou

Divya,

According to my recent Spark tuning experiences, optimal executor-memory
size not only depends on your workload characteristics (e.g. working set
size at each job stage) and input data size, but also depends on your total
available memory and memory requirements of other components like driver
(also depends on how your workload interacts with driver) and underlying
storage. In my opinion, it may be difficult to derive one generic and easy
formular to describe all the dynamic relationships.

Best Regards,
Jia

On Wed, Feb 3, 2016 at 12:13 AM, Divya Gehlot 
wrote:

> Hi,
>
> I would like to know how to calculate how much  -executor-memory should we
> allocate , how many num-executors,total-executor-cores we should give while
> submitting spark jobs .
> Is there any formula for it ?
>
>
> Thanks,
> Divya
>

Re: Dynamic sql in Spark 1.5

2016-02-02 Thread Divya Gehlot

Hi,
I have data set like :
Dataset 1
HeaderCol1 HeadCol2 HeadCol3
 dataset 1 dataset2 dataset 3
dataset 11 dataset13 dataset 13
dataset 21 dataset22 dataset 23

Datset 2
HeadColumn1 HeadColumn2HeadColumn3 HeadColumn4
Tag1  Dataset1
Tag2  Dataset1   Dataset2
Tag3  Dataset1  Dataset2   Dataset3
Tag4 DifferentDataset1
Tag5 DifferentDataset1   DifferentDataset2
Tag6 DifferentDataset1DifferentDataset2
DifferentDataset3


My requirement is to tag dataset(adding one more column) based on dataset 1


Can I do implement it in spark.
In RDBMS we have implemented using dynamic sql.

Would really appreciate the help.


Thanks,
Divya





On 3 February 2016 at 11:42, Ali Tajeldin EDU  wrote:

> While you can construct the SQL string dynamically in scala/java/python,
> it would be best to use the Dataframe API for creating dynamic SQL
> queries.  See
> http://spark.apache.org/docs/1.5.2/sql-programming-guide.html for details.
>
> On Feb 2, 2016, at 6:49 PM, Divya Gehlot  wrote:
>
> Hi,
> Does Spark supports dyamic sql ?
> Would really appreciate the help , if any one could share some
> references/examples.
>
>
>
> Thanks,
> Divya
>
>
>

Re: [MLLib] Is the order of the coefficients in a LogisticRegresionModel kept ?

2016-02-02 Thread Yanbo Liang

For you case, it's true.
But not always correct for a pipeline model, some transformers in pipeline
will change the features such as OneHotEncoder.

2016-02-03 1:21 GMT+08:00 jmvllt :

> Hi everyone,
>
> This may sound like a stupid question but I need to be sure of this :
>
> Given a dataframe composed by « n » features  : f1, f2, …, fn
>
> For each row of my dataframe, I create a labeled point :
> val row_i = LabeledPoint(label, Vectors.dense(v1_i,v2_i,…, vn_i) )
> where v1_i,v2_i,…, vn_i are respectively the values of the features f1, f2,
> …, fn of the i th row.
>
> Then, I fit a pipeline composed by a standardScaler and a
> logisticRegression
> model.
> When I get back my LogisticRegressionModel and StandardScalerModel from the
> pipeline, I’m calling the getters :
> LogisticRegressionModel.coefficients, StandardScalerModel.mean and
> StandardScalerModel.std
>
> This gives me 3 vectors of length « n »
>
> My question is the following :
> Am I assured that the element of index « j » of each vectors correspond to
> the feature « j »  ? Is the "*order*" of the feature kept ?
> e.g : Is StandardScalerModel.mean(j) the mean of the feature « j » of my
> data frame ?
>
> Thanks for your time.
> Regards,
> J.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Is-the-order-of-the-coefficients-in-a-LogisticRegresionModel-kept-tp26137.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

55 matches

Mail list logo