date:20160624

[no subject]

2016-06-24 Thread Rama Perubotla

Unsubscribe

Re: Running JavaBased Implementation of StreamingKmeans Spark

2016-06-24 Thread Jayant Shekhar

Hi Biplop,

Can you try adding new files to the training/test directories after you
have started your streaming application! Especially the test directory as
you are printing your predictions.

On Fri, Jun 24, 2016 at 2:32 PM, Biplob Biswas 
wrote:

>
> Hi,
>
> I implemented the streamingKmeans example provided in the spark website but
> in Java.
> The full implementation is here,
>
> http://pastebin.com/CJQfWNvk
>
> But i am not getting anything in the output except occasional timestamps
> like one below:
>
> ---
> Time: 1466176935000 ms
> ---
>
> Also, i have 2 directories:
> "D:\spark\streaming example\Data Sets\training"
> "D:\spark\streaming example\Data Sets\test"
>
> and inside these directories i have 1 file each "samplegpsdata_train.txt"
> and "samplegpsdata_test.txt" with training data having 500 datapoints and
> test data with 60 datapoints.
>
> I am very new to the spark systems and any help is highly appreciated.
>
> Thank you so much
> Biplob Biswas
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementation-of-StreamingKmeans-Spark-tp27225.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Poor performance of using spark sql over gzipped json files

2016-06-24 Thread Shuai Lin

Hi,

We have tried to use spark sql to process some gzipped json-format log
files stored on S3 or HDFS. But the performance is very poor.

For example, here is the code that I run over 20 gzipped files (total size
of them is 4GB compressed and ~40GB when decompressed)

gzfile = 's3n://my-logs-bucket/*.gz' # or 'hdfs://nameservice1/tmp/*.gz'
sc_file = sc.textFile(gzfile)
sc_file.cache()
df = sqlContext.jsonRDD(sc_file)
df.select('*').limit(1).show()

With 6 executors launched, each with 2 cpu cores and 5GB RAM, the
"df.select" operation would always take more than 150 secs to finish,
regardless of whether the files are stored on s3 or HDFS.

BTW we are running spark 1.6.1 on mesos, with fine-grained mode.

Downloading from s3 is fast. In another test within the same environment,
it takes no more than 2 minutes to finish a simple "sc_file.count()" over
500 similar files whose total size is 15GB when compressed, and 400GB when
decompressed.

I thought the bottleneck might be in the json schema auto-inference.
However, I have tried specify the schema explicitly instead of letting
spark infer it, but that makes no notable difference.

Things I plan to try soon:

* Decompress the gz files and save it to HDFS, construct a data frame on
decompressed files, then run sql over it.
* Or save the json files into parquet format on HDFS, and then run sql over
it.

Do you have any suggestions? Thanks!

Regards,
Shuai

Re: How can I use pyspark.ml.evaluation.BinaryClassificationEvaluator with point predictions instead of confidence intervals?

2016-06-24 Thread apu

SOLVED.

The rawPredictionCol input to BinaryClassificationEvaluator is a
vector specifying the prediction confidence for each class. Since we
are talking about binary classification the prediction for class 0 is
simply (1 - y_pred), where y_pred is the prediction for class 1.

So this can be applied to ALS for boolean ratings as follows:

# First, train model and create predictions
from pyspark.ml.recommendation import ALS
model = ALS().fit(trainingdata)
predictions = model.transform(validationdata)

# Vectorize predictions to prep for evaluation
from pyspark.mllib.linalg import Vectors, VectorUDT
predictionvectorizer = udf(lambda x: Vectors.dense(1.0 - x, x),
returnType=VectorUDT())
vectorizedpredictions =
predictions.withColumn("rawPrediction",predictionvectorizer("prediction"))

# Now evaluate predictions
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
evaluator.evaluate(vectorizedpredictions)

On Fri, Jun 24, 2016 at 10:42 AM, apu  wrote:
> pyspark.ml.evaluation.BinaryClassificationEvaluator expects
> predictions in the form of vectors (apparently designating confidence
> intervals), as described in
> https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator
>
> However, I am trying to evaluate ALS predictions, which are given as
> single point predictions without confidence intervals. Therefore,
> predictions are given as floats rather than vectors.
>
> How can I evaluate these using ml's BinaryClassificationEvaluator?
>
> (Note that this is a different function from mllib's
> BinaryClassificationMetrics.)
>
> Thanks!
>
> Apu

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark 2.0 Continuous Processing

2016-06-24 Thread kmat

Is there a way to checkpoint sink(s) to facilitate rewind processing from a
specific offset. 
For example a continuous query aggregated by month. 
On the 10 month would like to re-compute information between 4th to 8th
months.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-2-0-Continuous-Processing-tp27226.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Batch details are missing

2016-06-24 Thread C. Josephson

We're trying to resolve some performance issues with spark streaming using
the application UI, but the batch details page doesn't seem to be working.
When I click on a batch in the streaming application UI, I expect to see
something like this: http://i.stack.imgur.com/ApF8z.png

But instead we see this:
[image: Inline image 1]

Any ideas why we aren't getting any job details? We are running pySpark
1.5.0.

Thanks,
-cjoseph

Running JavaBased Implementation of StreamingKmeans Spark

2016-06-24 Thread Biplob Biswas


Hi, 

I implemented the streamingKmeans example provided in the spark website but
in Java. 
The full implementation is here, 

http://pastebin.com/CJQfWNvk

But i am not getting anything in the output except occasional timestamps
like one below: 

--- 
Time: 1466176935000 ms 
--- 

Also, i have 2 directories: 
"D:\spark\streaming example\Data Sets\training" 
"D:\spark\streaming example\Data Sets\test" 

and inside these directories i have 1 file each "samplegpsdata_train.txt"
and "samplegpsdata_test.txt" with training data having 500 datapoints and
test data with 60 datapoints. 

I am very new to the spark systems and any help is highly appreciated. 

Thank you so much 
Biplob Biswas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Running-JavaBased-Implementation-of-StreamingKmeans-Spark-tp27225.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind

I was able to resolve the serialization issue. The root cause was, I was
accessing the config values within foreachRDD{}.
The solution was to extract the values from config outside the foreachRDD
scope and send in values to the loop directly. Probably something obvious
as we cannot have nested distribution data sets. Mentioning it here for
benefit of anyone else stumbling upon the same issue.

regards
Sunita

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the 
> context, marking it as stopped
> java.lang.NullPointerException
>   at 
> com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
>   at 
> com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
>   at 
> com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
>   at java.lang.Throwable.writeObject(Throwable.java:985)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply$mcV$sp(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
>   at 
> org.apache.spark.streaming.Checkpoint$.serialize(Checkpoint.scala:142)
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:554)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:73)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:67)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:67)
>   at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
>
>
> It seems to be a typical issue. All I am doing here is as below:
>
> Object ProcessingEngine{
>
> def initializeSpark(customer:String):StreamingContext={
>   LogHandler.log.info("InitialeSpark")
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(AppConf)
>   implicit val sparkConf: SparkConf = new SparkConf().setAppName(customer)
>   val ssc: StreamingContext = new StreamingContext(sparkConf, 
> Seconds(custConf.getLong("batchDurSec")))
>   ssc.checkpoint(custConf.getString("checkpointDir"))
>   ssc
> }
>
> def createDataStreamFromKafka(customer:String, ssc: 
> StreamingContext):DStream[Array[Byte]]={
>   val

Re: Logging trait in Spark 2.0

2016-06-24 Thread Jonathan Kelly

Ted, how is that thread related to Paolo's question?

On Fri, Jun 24, 2016 at 1:50 PM Ted Yu  wrote:

> See this related thread:
>
>
> http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+
>
> On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno 
> wrote:
>
>> Hi,
>>
>> developing a Spark Streaming custom receiver I noticed that the Logging
>> trait isn't accessible anymore in Spark 2.0.
>>
>> trait Logging in package internal cannot be accessed in package
>> org.apache.spark.internal
>>
>> For developing a custom receiver what is the preferred way for logging ?
>> Just using log4j dependency as any other Java/Scala library/application ?
>>
>> Thanks,
>> Paolo
>>
>> *Paolo Patierno*
>>
>> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
>> Embedded & IoT*
>> *Microsoft Azure Advisor*
>>
>> Twitter : @ppatierno 
>> Linkedin : paolopatierno 
>> Blog : DevExperience 
>>
>
>

Model Quality Tracking

2016-06-24 Thread Benjamin Kim

Has anyone implemented a way to track the performance of a data model? We 
currently have an algorithm to do record linkage and spit out statistics of 
matches, non-matches, and/or partial matches with reason codes of why we didn’t 
match accurately. In this way, we will know if something goes wrong down the 
line. All of this goes into a csv file directories partitioned by datetime with 
a hive table on top. Then, we can do analytical queries and even charting if 
need be. All of this is very manual, but I was wondering if there is a package, 
software, built-in module, etc. that would do this automatically. Since we are 
using CDH, it would be great if these graphs could be integrated into Cloudera 
Manager too.

Any advice is welcome.

Thanks,
Ben


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Xinh Huynh

Hi Martin,

Since your schema is dynamic, how would you use Datasets? Would you know
ahead of time the row type T in a Dataset[T]?

One option is to start with DataFrames in the beginning of your data
pipeline, figure out the field types, and then switch completely over to
RDDs or Dataset in the next stage of the pipeline.

Also, I'm not sure what the custom Java mappers are doing - could you use
them as UDFs within a DataFrame?

Xinh

On Fri, Jun 24, 2016 at 11:42 AM, Martin Serrano  wrote:

> Indeed.  But I'm dealing with 1.6 for now unfortunately.
>
>
> On 06/24/2016 02:30 PM, Ted Yu wrote:
>
> In Spark 2.0, Dataset and DataFrame are unified.
>
> Would this simplify your use case ?
>
> On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano 
> wrote:
>
>> Hi,
>>
>> I'm exposing a custom source to the Spark environment.  I have a question
>> about the best way to approach this problem.
>>
>> I created a custom relation for my source and it creates a
>> DataFrame.  My custom source knows the data types which are
>> *dynamic* so this seemed to be the appropriate return type.  This works
>> fine.
>>
>> The next step I want to take is to expose some custom mapping functions
>> (written in Java).  But when I look at the APIs, the map method for
>> DataFrame returns an RDD (not a DataFrame).  (Should I use
>> SqlContext.createDataFrame on the result? -- does this result in additional
>> processing overhead?)  The Dataset type seems to be more of what I'd be
>> looking for, it's map method returns the Dataset type.  So chaining them
>> together is a natural exercise.
>>
>> But to create the Dataset from a DataFrame, it appears that I have to
>> provide the types of each field in the Row in the DataFrame.as[...]
>> method.  I would think that the DataFrame would be able to do this
>> automatically since it has all the types already.
>>
>> This leads me to wonder how I should be approaching this effort.  As all
>> the fields and types are dynamic, I cannot use beans as my type when
>> passing data around.  Any advice would be appreciated.
>>
>> Thanks,
>> Martin
>>
>>
>>
>>
>
>

Re: Logging trait in Spark 2.0

2016-06-24 Thread Ted Yu

See this related thread:

http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+

On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno  wrote:

> Hi,
>
> developing a Spark Streaming custom receiver I noticed that the Logging
> trait isn't accessible anymore in Spark 2.0.
>
> trait Logging in package internal cannot be accessed in package
> org.apache.spark.internal
>
> For developing a custom receiver what is the preferred way for logging ?
> Just using log4j dependency as any other Java/Scala library/application ?
>
> Thanks,
> Paolo
>
> *Paolo Patierno*
>
> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
> Embedded & IoT*
> *Microsoft Azure Advisor*
>
> Twitter : @ppatierno 
> Linkedin : paolopatierno 
> Blog : DevExperience 
>

Unsubscribe

2016-06-24 Thread R. Revert

El jun. 24, 2016 1:55 PM, "Steve Florence"  escribió:

>
>

Unsubscribe

2016-06-24 Thread Steve Florence

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano

Indeed.  But I'm dealing with 1.6 for now unfortunately.

On 06/24/2016 02:30 PM, Ted Yu wrote:
In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano 
> wrote:
Hi,

I'm exposing a custom source to the Spark environment.  I have a question about 
the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame.  My 
custom source knows the data types which are dynamic so this seemed to be the 
appropriate return type.  This works fine.

The next step I want to take is to expose some custom mapping functions 
(written in Java).  But when I look at the APIs, the map method for DataFrame 
returns an RDD (not a DataFrame).  (Should I use SqlContext.createDataFrame on 
the result? -- does this result in additional processing overhead?)  The 
Dataset type seems to be more of what I'd be looking for, it's map method 
returns the Dataset type.  So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide 
the types of each field in the Row in the DataFrame.as[...] method.  I would 
think that the DataFrame would be able to do this automatically since it has 
all the types already.

This leads me to wonder how I should be approaching this effort.  As all the 
fields and types are dynamic, I cannot use beans as my type when passing data 
around.  Any advice would be appreciated.

Thanks,
Martin

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Ted Yu

In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano  wrote:

> Hi,
>
> I'm exposing a custom source to the Spark environment.  I have a question
> about the best way to approach this problem.
>
> I created a custom relation for my source and it creates a
> DataFrame.  My custom source knows the data types which are *dynamic*
> so this seemed to be the appropriate return type.  This works fine.
>
> The next step I want to take is to expose some custom mapping functions
> (written in Java).  But when I look at the APIs, the map method for
> DataFrame returns an RDD (not a DataFrame).  (Should I use
> SqlContext.createDataFrame on the result? -- does this result in additional
> processing overhead?)  The Dataset type seems to be more of what I'd be
> looking for, it's map method returns the Dataset type.  So chaining them
> together is a natural exercise.
>
> But to create the Dataset from a DataFrame, it appears that I have to
> provide the types of each field in the Row in the DataFrame.as[...]
> method.  I would think that the DataFrame would be able to do this
> automatically since it has all the types already.
>
> This leads me to wonder how I should be approaching this effort.  As all
> the fields and types are dynamic, I cannot use beans as my type when
> passing data around.  Any advice would be appreciated.
>
> Thanks,
> Martin
>
>
>
>

How can I use pyspark.ml.evaluation.BinaryClassificationEvaluator with point predictions instead of confidence intervals?

2016-06-24 Thread apu

pyspark.ml.evaluation.BinaryClassificationEvaluator expects
predictions in the form of vectors (apparently designating confidence
intervals), as described in
https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.evaluation.BinaryClassificationEvaluator

However, I am trying to evaluate ALS predictions, which are given as
single point predictions without confidence intervals. Therefore,
predictions are given as floats rather than vectors.

How can I evaluate these using ml's BinaryClassificationEvaluator?

(Note that this is a different function from mllib's
BinaryClassificationMetrics.)

Thanks!

Apu

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark connecting to Hive in another EMR cluster

2016-06-24 Thread Dave Maughan

Hi,

We're trying to get a Spark (1.6.1) job running on EMR (4.7.1) that's
connecting to the Hive metastore in another EMR cluster. A simplification
of what we're doing is below

val sparkConf = new SparkConf().setAppName("MyApp")
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)

sqlContext.setConf("hive.metastore.uris", "thrift://other.emr.master:9083")

sqlContext.sql("use db1") //any statement using sql method.

sqlContext.table("db2.table1").show

The above works but if we remove the sql method then the table.show fails
with NoSuchTableException indicating it hasn't connected to the external
metastore.
So it seems something in the sql method code path does something to connect
to the metastore that table doesn't.

Does this sound familiar to anyone? Is there something we can do to avoid
having to call sql just to force it to connect/configure correctly?

Thanks,
Dave

Re: Improving performance of a kafka spark streaming app

2016-06-24 Thread Cody Koeninger

Unless I'm misreading the image you posted, it does show event counts
for the single batch that is still running, with 1.7 billion events in
it.  The recent batches do show 0 events, but I'm guessing that's
because they're actually empty.

When you said you had a kafka topic with 1.7 billion events in it, did
you mean it just statically contains that many events, and no new
events are coming in currently?  If that's the case, you may be better
off just generating RDDs of an appropriate range of offsets, one after
the other, rather than using streaming.

I'm also still not clear if you have tried benchmarking a job that
simply reads from your topic, without inserting into hbase.

On Thu, Jun 23, 2016 at 12:09 AM, Colin Kincaid Williams  wrote:
> Streaming UI tab showing empty events and very different metrics than on 1.5.2
>
> On Thu, Jun 23, 2016 at 5:06 AM, Colin Kincaid Williams  
> wrote:
>> After a bit of effort I moved from a Spark cluster running 1.5.2, to a
>> Yarn cluster running 1.6.1 jars. I'm still setting the maxRPP. The
>> completed batches are no longer showing the number of events processed
>> in the Streaming UI tab . I'm getting around 4k inserts per second in
>> hbase, but I haven't yet tried to remove or reset the mRPP.  I will
>> attach a screenshot of the UI tab. It shows significantly lower
>> figures for processing and delay times, than the previous posted shot.
>> It also shows the batches as empty, however I see the requests hitting
>> hbase.
>>
>> Then it's possible my issues were related to running on the Spark
>> 1.5.2 cluster. Also is the missing event count in the completed
>> batches a bug? Should I file an issue?
>>
>> On Tue, Jun 21, 2016 at 9:04 PM, Colin Kincaid Williams  
>> wrote:
>>> Thanks @Cody, I will try that out. In the interm, I tried to validate
>>> my Hbase cluster by running a random write test and see 30-40K writes
>>> per second. This suggests there is noticeable room for improvement.
>>>
>>> On Tue, Jun 21, 2016 at 8:32 PM, Cody Koeninger  wrote:
 Take HBase out of the equation and just measure what your read
 performance is by doing something like

 createDirectStream(...).foreach(_.println)

 not take() or print()

 On Tue, Jun 21, 2016 at 3:19 PM, Colin Kincaid Williams  
 wrote:
> @Cody I was able to bring my processing time down to a second by
> setting maxRatePerPartition as discussed. My bad that I didn't
> recognize it as the cause of my scheduling delay.
>
> Since then I've tried experimenting with a larger Spark Context
> duration. I've been trying to get some noticeable improvement
> inserting messages from Kafka -> Hbase using the above application.
> I'm currently getting around 3500 inserts / second on a 9 node hbase
> cluster. So far, I haven't been able to get much more throughput. Then
> I'm looking for advice here how I should tune Kafka and Spark for this
> job.
>
> I can create a kafka topic with as many partitions that I want. I can
> set the Duration and maxRatePerPartition. I have 1.7 billion messages
> that I can insert rather quickly into the Kafka queue, and I'd like to
> get them into Hbase as quickly as possible.
>
> I'm looking for advice regarding # Kafka Topic Partitions / Streaming
> Duration / maxRatePerPartition / any other spark settings or code
> changes that I should make to try to get a better consumption rate.
>
> Thanks for all the help so far, this is the first Spark application I
> have written.
>
> On Mon, Jun 20, 2016 at 12:32 PM, Colin Kincaid Williams  
> wrote:
>> I'll try dropping the maxRatePerPartition=400, or maybe even lower.
>> However even at application starts up I have this large scheduling
>> delay. I will report my progress later on.
>>
>> On Mon, Jun 20, 2016 at 2:12 PM, Cody Koeninger  
>> wrote:
>>> If your batch time is 1 second and your average processing time is
>>> 1.16 seconds, you're always going to be falling behind.  That would
>>> explain why you've built up an hour of scheduling delay after eight
>>> hours of running.
>>>
>>> On Sat, Jun 18, 2016 at 4:40 PM, Colin Kincaid Williams 
>>>  wrote:
 Hi Mich again,

 Regarding batch window, etc. I have provided the sources, but I'm not
 currently calling the window function. Did you see the program source?
 It's only 100 lines.

 https://gist.github.com/drocsid/b0efa4ff6ff4a7c3c8bb56767d0b6877

 Then I would expect I'm using defaults, other than what has been shown
 in the configuration.

 For example:

 In the launcher configuration I set --conf
 spark.streaming.kafka.maxRatePerPartition=500 \ and I

DataFrame versus Dataset creation and usage

2016-06-24 Thread Martin Serrano

Hi,

I'm exposing a custom source to the Spark environment.  I have a question about 
the best way to approach this problem.

I created a custom relation for my source and it creates a DataFrame.  My 
custom source knows the data types which are dynamic so this seemed to be the 
appropriate return type.  This works fine.

The next step I want to take is to expose some custom mapping functions 
(written in Java).  But when I look at the APIs, the map method for DataFrame 
returns an RDD (not a DataFrame).  (Should I use SqlContext.createDataFrame on 
the result? -- does this result in additional processing overhead?)  The 
Dataset type seems to be more of what I'd be looking for, it's map method 
returns the Dataset type.  So chaining them together is a natural exercise.

But to create the Dataset from a DataFrame, it appears that I have to provide 
the types of each field in the Row in the DataFrame.as[...] method.  I would 
think that the DataFrame would be able to do this automatically since it has 
all the types already.

This leads me to wonder how I should be approaching this effort.  As all the 
fields and types are dynamic, I cannot use beans as my type when passing data 
around.  Any advice would be appreciated.

Thanks,
Martin

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Cody Koeninger

That looks like a classpath problem.  You should not have to include
the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10
already has a transitive dependency on it.  That being said, 0.8.2.1
is the correct version, so that's a little strange.

How are you building and submitting your application?

Finally, if this ends up being a CDH related issue, you may have
better luck on their forum.

On Thu, Jun 23, 2016 at 1:16 PM, Sunita Arvind  wrote:
> Also, just to keep it simple, I am trying to use 1.6.0CDH5.7.0 in the
> pom.xml as the cluster I am trying to run on is CDH5.7.0 with spark 1.6.0.
>
> Here is my pom setting:
>
>
> 1.6.0-cdh5.7.0
> 
> org.apache.spark
> spark-core_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-streaming_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-sql_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-streaming-kafka_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.kafka
> kafka_2.10
> 0.8.2.1
> compile
> 
>
> But trying to execute the application throws errors like below:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> kafka/cluster/BrokerEndPoint
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:87)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:86)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:86)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:85)
> at scala.util.Either$RightProjection.flatMap(Either.scala:523)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.findLeaders(KafkaCluster.scala:85)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:179)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:161)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLatestLeaderOffsets(KafkaCluster.scala:150)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:215)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:211)
> at scala.util.Either$RightProjection.flatMap(Either.scala:523)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at
> com.edgecast.engine.ConcurrentOps$.createDataStreamFromKafka(ConcurrentOps.scala:68)
> at
> com.edgecast.engine.ConcurrentOps$.startProcessing(ConcurrentOps.scala:32)
> at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:33)
> at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: java.lang.ClassNotFoundException: kafka.cluster.BrokerEndPoint
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
>

streaming on yarn

2016-06-24 Thread Alex Dzhagriev

Hello,

Can someone, please, share the opinions on the options available for
running spark streaming jobs on yarn? The first thing comes to my mind is
to use slider. Googling for such experience didn't give me much. From my
experience running the same jobs on mesos, I have two concerns: automatic
scaling (not blocking the resources if they is no data in the stream) and
the ui to manage the running jobs.

Thanks, Alex.

Logging trait in Spark 2.0

2016-06-24 Thread Paolo Patierno

Hi,

developing a Spark Streaming custom receiver I noticed that the Logging trait 
isn't accessible anymore in Spark 2.0.

trait Logging in package internal cannot be accessed in package 
org.apache.spark.internal

For developing a custom receiver what is the preferred way for logging ? Just 
using log4j dependency as any other Java/Scala library/application ?

Thanks,
Paolo

Paolo PatiernoSenior Software Engineer (IoT) @ Red Hat
Microsoft MVP on Windows Embedded & IoTMicrosoft Azure Advisor 
Twitter : @ppatierno
Linkedin : paolopatierno
Blog : DevExperience

Re: Performance issue with spark ml model to make single predictions on server side

2016-06-24 Thread Nick Pentreath

Currently, spark-ml models and pipelines are only usable in Spark. This
means you must use Spark's machinery (and pull in all its dependencies) to
do model serving. Also currently there is no fast "predict" method for a
single Vector instance.

So for now, you are best off going with PMML, or exporting your model in
your own custom format, and re-loading it into your own custom format for
serving. You can also take a look at PredictionIO (https://prediction.io/)
for another serving option, or TensorFlow serving (
https://tensorflow.github.io/serving/).

On Thu, 23 Jun 2016 at 13:40 philippe v  wrote:

> Hello,
>
> I trained a linear regression model with spark-ml. I serialized the model
> pipeline with classical java serialization. Then I loaded it in a
> webservice
> to compute predictions.
>
> For each request recieved by the webservice I create a 1 row dataframe to
> compute that prediction.
>
> Probleme is that it take too much time
>
> Is there some good practices to do that kind of stuff ?
>
> I could export all model's coeffs with PMML and make computations in pure
> java but I keep it in last resort.
>
> Does any one have some hints to increase performances ?
>
> Philippe
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Performance-issue-with-spark-ml-model-to-make-single-predictions-on-server-side-tp27217.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke



Yes yes true. I just wonder if somebody took measurements for all different 
types of problems in the Big Data area and created some scientific analysis how 
much time is wasted on serialization deserialization to support the figure of 
80% ;)



> On 24 Jun 2016, at 10:35, Jacek Laskowski  wrote:
> 
> Hi Jorn,
> 
> You can measure the time for ser/deser yourself using web UI or 
> SparkListeners.
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
>> On Fri, Jun 24, 2016 at 10:14 AM, Jörn Franke  wrote:
>> I would push the Spark people to provide equivalent functionality . In the 
>> end it is a deserialization/serialization process which should not be done 
>> back and forth because it is one of the more costly aspects during 
>> processing. It needs to convert Java objects to a binary representation. It 
>> is ok to do it once, because afterwards the access in binary form is much 
>> more efficient, but this will be completely irrelevant if you convert back 
>> and forth all the time.
>> 
>> I have heard somewhere the figure that serialization/deserialization takes 
>> 80% of the time in the big data world, but i would be happy to see this 
>> figure be confirmed empirically for different scenarios. Unfortunately I do 
>> not have a source for this figure so do not take it as granted.
>> 
>>> On 24 Jun 2016, at 08:00, pan  wrote:
>>> 
>>> Hello,
>>>  I am trying to understand the cost of converting an RDD to Dataframe and
>>> back. Would a conversion back and forth very frequently cost performance.
>>> 
>>> I do observe that some operations like join are implemented very differently
>>> for RDD (pair) and Dataframe so trying to figure out the cose of converting
>>> one to another
>>> 
>>> Regards,
>>> Pranav
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: 
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark Xml schema help

2016-06-24 Thread Nandan Thakur

Hi All,

I have been using spark-xml  in
one of my project to process some xml files. but when I don't provide
custom schema, this jar automatically generate following schema:-
root
 |-- UserData: struct (nullable = true)
 ||-- UserValue: array (nullable = true)
 |||-- element: string (containsNull = true)
 ||-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- instanceRefs: string (nullable = true)
 |-- name: string (nullable = true)


But this is not the correct schema, what i want is something like this:-

root
 |-- UserData: struct (nullable = true)
 ||-- UserValue: array (nullable = true)
 |||-- title: string (nullable = true)
 |||-- value: string (nullable = true))
 ||-- type: string (nullable = true)
 |-- id: string (nullable = true)
 |-- instanceRefs: string (nullable = true)
 |-- name: string (nullable = true)

I have been trying to provide custom schema but it is always saying that
field 'title' is not present in the schema.
I tried to change datatype and structure but its not working. am i missing
something or there is some bug with spark-xml for nested structure???

Thanks and Regards,
Nandan

Re: problem running spark with yarn-client not using spark-submit

2016-06-24 Thread Mich Talebzadeh

Hi,
Trying to run spark with yarn-client not using spark-submit here

what are you using to submit the job? spark-shell, spark-sql  or anything
else


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 24 June 2016 at 12:01,  wrote:

>
> Hello guys,
>
> Trying to run spark with yarn-client not using spark-submit here but the
> jobs kept failed while AM launching executor.
> The error collected by yarn like below.
> Looks like some environment setting is missing?
> Could someone help me out with this.
>
> Thanks  in advance!
> HY Chung
>
> Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
> MaxPermSize=256m; support was removed in 8.0
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/Logging
>  at java.lang.ClassLoader.defineClass1(Native Method)
>  at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>  at java.security.SecureClassLoader.defineClass
> (SecureClassLoader.java:142)
>  at java.net.URLClassLoader.defineClass
> (URLClassLoader.java:455)
>  at
> java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>  at java.security.AccessController.doPrivileged(Native
> Method)
>  at
> java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>  at sun.misc.Launcher$AppClassLoader.loadClass
> (Launcher.java:308)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>  at org.apache.spark.deploy.yarn.ExecutorLauncher$.main
> (ApplicationMaster.scala:674)
>  at org.apache.spark.deploy.yarn.ExecutorLauncher.main
> (ApplicationMaster.scala)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>  at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>  at java.security.AccessController.doPrivileged(Native
> Method)
>  at
> java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>  at sun.misc.Launcher$AppClassLoader.loadClass
> (Launcher.java:308)
>  at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>  ... 14 more
>
>
> Best regards,
>
> S.Y. Chung 鍾學毅
> F14MITD
> Taiwan Semiconductor Manufacturing Company, Ltd.
> Tel: 06-5056688 Ext: 734-6325
>
>  ---
>  TSMC PROPERTY
>  This email communication (and any attachments) is proprietary information
>  for the sole use of its
>  intended recipient. Any unauthorized review, use or distribution by anyone
>  other than the intended
>  recipient is strictly prohibited.  If you are not the intended recipient,
>  please notify the sender by
>  replying to this email, and then delete this email and any copies of it
>  immediately. Thank you.
>
>  ---
>
>
>

How to convert a Random Forest model built in R to a similar model in Spark

2016-06-24 Thread Neha Mehta

Hi Sun,

I am trying to build a model in Spark. Here are the parameters which were
used in R for creating the model, I am unable to figure out how to specify
a similar input to the random forest regressor in Spark so that I get a
similar model in Spark.

https

://cran.r-project.org/web/packages/

randomForest

/randomForest.pdf

mytry=3

ntree=500

importance=TRUE

maxnodes = NULL
On May 31, 2016 7:03 AM, "Sun Rui"  wrote:

I mean train random forest model (not using R) and use it for prediction
together using Spark ML.

On May 30, 2016, at 20:15, Neha Mehta  wrote:

Thanks Sujeet.. will try it out.

Hi Sun,

Can you please tell me what did you mean by "Maybe you can try using the
existing random forest model" ? did you mean creating the model again using
Spark MLLIB?

Thanks,
Neha

> From: sujeet jog 
> Date: Mon, May 30, 2016 at 4:52 PM
> Subject: Re: Can we use existing R model in Spark
> To: Sun Rui 
> Cc: Neha Mehta , user 
>
>
> Try to invoke a R script from Spark using rdd pipe method , get the work
> done & and receive the model back in RDD.
>
>
> for ex :-
> .   rdd.pipe("")
>
>
> On Mon, May 30, 2016 at 3:57 PM, Sun Rui  wrote:
>
>> Unfortunately no. Spark does not support loading external modes (for
>> examples, PMML) for now.
>> Maybe you can try using the existing random forest model in Spark.
>>
>> On May 30, 2016, at 18:21, Neha Mehta  wrote:
>>
>> Hi,
>>
>> I have an existing random forest model created using R. I want to use
>> that to predict values on Spark. Is it possible to do the same? If yes,
>> then how?
>>
>> Thanks & Regards,
>> Neha
>>
>>
>>
>
>

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Mich Talebzadeh

Hi Punneet,

File does not exist:
hdfs://localhost:8020/user/opc/.sparkStaging/application_1466711725829_0033/pipeline-lib-0.1.0-SNAPSHOT.jar

indicates a YARN issue. It is trying to get that file from HDFS and copy it
across to /tmp directory.


   1. Check that the class is actually created at compile time vis sbn mvt
   or if the jar file exists
   2. check that the working directory in YARN have the correct permissions
   3.

In yarn-site.xml check the following parameter is set


yarn.nodemanager.local-dirs
/tmp
  


HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 24 June 2016 at 10:46, Jeff Zhang  wrote:

> You might have multiple java servlet jars on your classpath.
>
> On Fri, Jun 24, 2016 at 3:31 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> can you please check the yarn log files to see what they say (both the
>> nodemamager and resourcemanager)
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 24 June 2016 at 08:14, puneet kumar 
>> wrote:
>>
>>>
>>>
>>> I am getting below error thrown when I submit Spark Job using Spark
>>> Submit on Yarn. Need a quick help on what's going wrong here.
>>>
>>> 16/06/24 01:09:25 WARN AbstractLifeCycle: FAILED 
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter-791eb5d5: 
>>> java.lang.IllegalStateException: class 
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
>>> javax.servlet.Filter
>>> java.lang.IllegalStateException: class 
>>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
>>> javax.servlet.Filter
>>> at 
>>> org.spark-project.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>>> at 
>>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>> at 
>>> org.spark-project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:768)
>>> at 
>>> org.spark-project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
>>> at 
>>> org.spark-project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:717)
>>> at 
>>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>> at 
>>> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>>> at 
>>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>> at 
>>> org.spark-project.jetty.server.handler.HandlerCollection.doStart(HandlerCollection.java:229)
>>> at 
>>> org.spark-project.jetty.server.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:172)
>>> at 
>>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>> at 
>>> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>>> at org.spark-project.jetty.server.Server.doStart(Server.java:282)
>>> at 
>>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>> at 
>>> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
>>> at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>>> at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>>> at 
>>> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
>>> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>> at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
>>> at 
>>> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
>>> at org.apache.spark.ui.WebUI.bind(WebUI.scala:137)
>>> at 
>>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>>> at 
>>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>>> at scala.Option.foreach(Option.scala:236)
>>> at org.apache.spark.SparkContext.(SparkContext.scala:481)
>>> at 
>>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
>>>
>>>
>>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

problem running spark with yarn-client not using spark-submit

2016-06-24 Thread sychungd


Hello guys,

Trying to run spark with yarn-client not using spark-submit here but the
jobs kept failed while AM launching executor.
The error collected by yarn like below.
Looks like some environment setting is missing?
Could someone help me out with this.

Thanks  in advance!
HY Chung

Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=256m; support was removed in 8.0
Exception in thread "main" java.lang.NoClassDefFoundError:
org/apache/spark/Logging
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
 at java.security.SecureClassLoader.defineClass
(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass
(URLClassLoader.java:455)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:367)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass
(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 at org.apache.spark.deploy.yarn.ExecutorLauncher$.main
(ApplicationMaster.scala:674)
 at org.apache.spark.deploy.yarn.ExecutorLauncher.main
(ApplicationMaster.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.spark.Logging
 at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass
(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 14 more


Best regards,

S.Y. Chung 鍾學毅
F14MITD
Taiwan Semiconductor Manufacturing Company, Ltd.
Tel: 06-5056688 Ext: 734-6325
 --- 
 TSMC PROPERTY   
 This email communication (and any attachments) is proprietary information   
 for the sole use of its 
 intended recipient. Any unauthorized review, use or distribution by anyone  
 other than the intended 
 recipient is strictly prohibited.  If you are not the intended recipient,   
 please notify the sender by 
 replying to this email, and then delete this email and any copies of it 
 immediately. Thank you. 
 ---

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Jeff Zhang

You might have multiple java servlet jars on your classpath.

On Fri, Jun 24, 2016 at 3:31 PM, Mich Talebzadeh 
wrote:

> can you please check the yarn log files to see what they say (both the
> nodemamager and resourcemanager)
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 24 June 2016 at 08:14, puneet kumar  wrote:
>
>>
>>
>> I am getting below error thrown when I submit Spark Job using Spark
>> Submit on Yarn. Need a quick help on what's going wrong here.
>>
>> 16/06/24 01:09:25 WARN AbstractLifeCycle: FAILED 
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter-791eb5d5: 
>> java.lang.IllegalStateException: class 
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
>> javax.servlet.Filter
>> java.lang.IllegalStateException: class 
>> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
>> javax.servlet.Filter
>>  at 
>> org.spark-project.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:768)
>>  at 
>> org.spark-project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
>>  at 
>> org.spark-project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:717)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.server.handler.HandlerCollection.doStart(HandlerCollection.java:229)
>>  at 
>> org.spark-project.jetty.server.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:172)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>>  at org.spark-project.jetty.server.Server.doStart(Server.java:282)
>>  at 
>> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>>  at 
>> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
>>  at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>>  at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>>  at 
>> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
>>  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>>  at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
>>  at 
>> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
>>  at org.apache.spark.ui.WebUI.bind(WebUI.scala:137)
>>  at 
>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>>  at 
>> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>>  at scala.Option.foreach(Option.scala:236)
>>  at org.apache.spark.SparkContext.(SparkContext.scala:481)
>>  at 
>> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
>>
>>
>>
>


-- 
Best Regards

Jeff Zhang

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Pranav Nakhe

Hello,
   The question came from the point that dataframe uses tungsten
improvements with usage of catalyst optimizer. So there would be some
additional work spark does to convert an RDD to dataframe to use the
optimizations/improvements available to dataframes.

Regards,
Pranav

On Fri, Jun 24, 2016 at 2:05 PM, Jacek Laskowski  wrote:

> Hi Jorn,
>
> You can measure the time for ser/deser yourself using web UI or
> SparkListeners.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Fri, Jun 24, 2016 at 10:14 AM, Jörn Franke 
> wrote:
> > I would push the Spark people to provide equivalent functionality . In
> the end it is a deserialization/serialization process which should not be
> done back and forth because it is one of the more costly aspects during
> processing. It needs to convert Java objects to a binary representation. It
> is ok to do it once, because afterwards the access in binary form is much
> more efficient, but this will be completely irrelevant if you convert back
> and forth all the time.
> >
> > I have heard somewhere the figure that serialization/deserialization
> takes 80% of the time in the big data world, but i would be happy to see
> this figure be confirmed empirically for different scenarios. Unfortunately
> I do not have a source for this figure so do not take it as granted.
> >
> >> On 24 Jun 2016, at 08:00, pan  wrote:
> >>
> >> Hello,
> >>   I am trying to understand the cost of converting an RDD to Dataframe
> and
> >> back. Would a conversion back and forth very frequently cost
> performance.
> >>
> >> I do observe that some operations like join are implemented very
> differently
> >> for RDD (pair) and Dataframe so trying to figure out the cose of
> converting
> >> one to another
> >>
> >> Regards,
> >> Pranav
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Re: Spark SQL NoSuchMethodException...DriverWrapper.()

2016-06-24 Thread Jacek Laskowski

Hi Mirko,

What exactly was the setting? I'd like to reproduce it. Can you file
an issue in JIRA to fix that?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jun 24, 2016 at 10:54 AM, Mirko  wrote:
> Hi,
>
> Many thanks for the suggestions.
> I discovered that the problem was related on a missing driver definition in
> the jdbc options map.
> The error wasn’t really helpful to understand that!
>
> Cheers,
> Mirko
>
> On 22 Jun 2016, at 18:11, markcitizen [via Apache Spark User List] <[hidden
> email]> wrote:
>
> Hello,
> I can't help you with your particular problem but usually errors like the
> one you're seeing are caused by class version incompatibility.
> I've recently spent a lot of time researching a problem similar to yours
> with Spark 1.6.1 (Scala).
> For us the collision was related to class version in
> org.jboss.netty.handler.ssl package.
>
> I think there are three ways to solve this:
> 1) Check jar versions for jars deployed as part of Spark runtime and use the
> same versions in your code
> 2) Update your Spark runtime libs (if you can) with versions of jars that
> work (that can be tricky)
> 3) The solution we used was to add shading configuration (in Build.scala)
> for packages that were colliding:
>
> val shadingRules = Seq(ShadeRule.rename("org.jboss.netty.handler.ssl.**" ->
> "shadeit.@1").inAll)
>
> assemblyShadeRules in assembly := shadingRules
>
> That last option worked best because it allowed us to control version
> collision for multiple packages/classes.
> I don't know if you're using Scala or Java but maybe this gives you some
> ideas on how to proceed.
> Fixing class version collision can be a messy ordeal, you'll need to move
> jars/versions around until it works. It looks like Maven also has a shading
> plugin so if you're using Java maybe you can try that:
> https://maven.apache.org/plugins/maven-shade-plugin/
>
> Best,
>
> M
>
>
> 
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-NoSuchMethodException-DriverWrapper-init-tp27171p27211.html
> To unsubscribe from Spark SQL
> NoSuchMethodException...DriverWrapper.(), click here.
> NAML
>
>
>
> 
> View this message in context: Re: Spark SQL
> NoSuchMethodException...DriverWrapper.()
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

How to write the DataFrame results back to HDFS with other then \n as record separator

2016-06-24 Thread Radha krishna

Hi,
i have some files in the hdfs with FS as field separator and RS as record
separator, i am able to read the files and able to process successfully.
how can i write the spark DataFrame result into the HDFS file with same
delimeters (FS as field separator and RS as record separator instead of \n)
using java
Can any one suggest..

Note: i need to use other than \n bez my data contains \n as part of the
column value.



Thanks & Regards
   Radha krishna

Re: Spark SQL NoSuchMethodException...DriverWrapper.()

2016-06-24 Thread Mirko

Hi,

Many thanks for the suggestions.
I discovered that the problem was related on a missing driver definition in the 
jdbc options map.
The error wasn’t really helpful to understand that!

Cheers,
Mirko

On 22 Jun 2016, at 18:11, markcitizen [via Apache Spark User List] 
>
 wrote:

Hello,
I can't help you with your particular problem but usually errors like the one 
you're seeing are caused by class version incompatibility.
I've recently spent a lot of time researching a problem similar to yours with 
Spark 1.6.1 (Scala).
For us the collision was related to class version in 
org.jboss.netty.handler.ssl package.

I think there are three ways to solve this:
1) Check jar versions for jars deployed as part of Spark runtime and use the 
same versions in your code
2) Update your Spark runtime libs (if you can) with versions of jars that work 
(that can be tricky)
3) The solution we used was to add shading configuration (in Build.scala) for 
packages that were colliding:

val shadingRules = Seq(ShadeRule.rename("org.jboss.netty.handler.ssl.**" -> 
"shadeit.@1").inAll)

assemblyShadeRules in assembly := shadingRules

That last option worked best because it allowed us to control version collision 
for multiple packages/classes.
I don't know if you're using Scala or Java but maybe this gives you some ideas 
on how to proceed.
Fixing class version collision can be a messy ordeal, you'll need to move 
jars/versions around until it works. It looks like Maven also has a shading 
plugin so if you're using Java maybe you can try that:
https://maven.apache.org/plugins/maven-shade-plugin/

Best,

M



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-NoSuchMethodException-DriverWrapper-init-tp27171p27211.html
To unsubscribe from Spark SQL NoSuchMethodException...DriverWrapper.(), 
click 
here.
NAML





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-NoSuchMethodException-DriverWrapper-init-tp27171p27223.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jacek Laskowski

Hi Jorn,

You can measure the time for ser/deser yourself using web UI or SparkListeners.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Jun 24, 2016 at 10:14 AM, Jörn Franke  wrote:
> I would push the Spark people to provide equivalent functionality . In the 
> end it is a deserialization/serialization process which should not be done 
> back and forth because it is one of the more costly aspects during 
> processing. It needs to convert Java objects to a binary representation. It 
> is ok to do it once, because afterwards the access in binary form is much 
> more efficient, but this will be completely irrelevant if you convert back 
> and forth all the time.
>
> I have heard somewhere the figure that serialization/deserialization takes 
> 80% of the time in the big data world, but i would be happy to see this 
> figure be confirmed empirically for different scenarios. Unfortunately I do 
> not have a source for this figure so do not take it as granted.
>
>> On 24 Jun 2016, at 08:00, pan  wrote:
>>
>> Hello,
>>   I am trying to understand the cost of converting an RDD to Dataframe and
>> back. Would a conversion back and forth very frequently cost performance.
>>
>> I do observe that some operations like join are implemented very differently
>> for RDD (pair) and Dataframe so trying to figure out the cose of converting
>> one to another
>>
>> Regards,
>> Pranav
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jörn Franke

I would push the Spark people to provide equivalent functionality . In the end 
it is a deserialization/serialization process which should not be done back and 
forth because it is one of the more costly aspects during processing. It needs 
to convert Java objects to a binary representation. It is ok to do it once, 
because afterwards the access in binary form is much more efficient, but this 
will be completely irrelevant if you convert back and forth all the time.

I have heard somewhere the figure that serialization/deserialization takes 80% 
of the time in the big data world, but i would be happy to see this figure be 
confirmed empirically for different scenarios. Unfortunately I do not have a 
source for this figure so do not take it as granted.

> On 24 Jun 2016, at 08:00, pan  wrote:
> 
> Hello,
>   I am trying to understand the cost of converting an RDD to Dataframe and
> back. Would a conversion back and forth very frequently cost performance.
> 
> I do observe that some operations like join are implemented very differently
> for RDD (pair) and Dataframe so trying to figure out the cose of converting
> one to another
> 
> Regards,
> Pranav
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp

i want add informations
when i created this dict i fllow this step :
 1-  i create list of all my variable :
   liste double liste int liste categoriel variable
all categoriel variable it s int  typed
2   i create al = listdouble+listint+listcateg :
 command
list(itertools.chain(fdouble,fint,f_index))
like that i keep order of variable  in this order i have all f_index  from
517: to 824
but when i create lable point i lose this order  and i lose type int .



2016-06-24 9:40 GMT+02:00 pseudo oduesp :

> Hi,
> how i can keep type of my variable like int
> because i  get this error when i call random forest algorithm with
>
> model = RandomForest.trainClassifier(rdf,
> numClasses=2,
> categoricalFeaturesInfo=d,
>  numTrees=3,
>  featureSubsetStrategy="auto",
>  impurity='gini',
>  maxDepth=4,
>  maxBins=42)
>
> where d it s dict containing  map of my categoriel feautures (i have
> aleady stirng indexed and transform to int )
>
>
> n error occurred while calling o20271.trainRandomForestModel.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 29 in stage 1898.0 failed 4 times, most recent failure: Lost task 29.3 in
> stage 1898.0 (TID 748788, prbigdata1s013.bigplay.bigdata.intraxa):
> java.lang.IllegalArgumentException: DecisionTree given invalid data:
> Feature 517 is categorical with values in {0,...,16, but a data point gives
> it value 48940.0.
>   Bad data point:
>

categoricalFeaturesInfo

2016-06-24 Thread pseudo oduesp

Hi,
how i can keep type of my variable like int
because i  get this error when i call random forest algorithm with

model = RandomForest.trainClassifier(rdf,
numClasses=2,
categoricalFeaturesInfo=d,
 numTrees=3,
 featureSubsetStrategy="auto",
 impurity='gini',
 maxDepth=4,
 maxBins=42)

where d it s dict containing  map of my categoriel feautures (i have aleady
stirng indexed and transform to int )


n error occurred while calling o20271.trainRandomForestModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
29 in stage 1898.0 failed 4 times, most recent failure: Lost task 29.3 in
stage 1898.0 (TID 748788, prbigdata1s013.bigplay.bigdata.intraxa):
java.lang.IllegalArgumentException: DecisionTree given invalid data:
Feature 517 is categorical with values in {0,...,16, but a data point gives
it value 48940.0.
  Bad data point:

Re: Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread Mich Talebzadeh

can you please check the yarn log files to see what they say (both the
nodemamager and resourcemanager)

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 24 June 2016 at 08:14, puneet kumar  wrote:

>
>
> I am getting below error thrown when I submit Spark Job using Spark Submit
> on Yarn. Need a quick help on what's going wrong here.
>
> 16/06/24 01:09:25 WARN AbstractLifeCycle: FAILED 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter-791eb5d5: 
> java.lang.IllegalStateException: class 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
> javax.servlet.Filter
> java.lang.IllegalStateException: class 
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a 
> javax.servlet.Filter
>   at 
> org.spark-project.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
>   at 
> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at 
> org.spark-project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:768)
>   at 
> org.spark-project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
>   at 
> org.spark-project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:717)
>   at 
> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at 
> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>   at 
> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at 
> org.spark-project.jetty.server.handler.HandlerCollection.doStart(HandlerCollection.java:229)
>   at 
> org.spark-project.jetty.server.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:172)
>   at 
> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at 
> org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
>   at org.spark-project.jetty.server.Server.doStart(Server.java:282)
>   at 
> org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
>   at 
> org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>   at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
>   at 
> org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
>   at 
> org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
>   at org.apache.spark.ui.WebUI.bind(WebUI.scala:137)
>   at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>   at 
> org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
>   at scala.Option.foreach(Option.scala:236)
>   at org.apache.spark.SparkContext.(SparkContext.scala:481)
>   at 
> org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)
>
>
>

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Mich Talebzadeh

Hi,

I do not profess at all that this this reply has any correlation with the
advanced people :)

However, in general a Data Frame adds the two-dimensional structure (table)
to RDD which is basically a construct that cannot be optimised due to
non-schema structure of RDD.

Now converting RDD to DF will add the cost of creating that metadata. So
there is cost associated with it. However, IMO the cost is inevitable as
the later stages of the app will be much more optimised and will
compensate for that initial cost

In other words this is a necessary step and a feature of a tool, much like
creating an index in a relational table. There is a cost in creating and
maintaining a DF but the benefits outweigh the cost of metadata addition.

So it is really an academic question in overall schema of things. Can we
get away without having a DF and its offshoot of temporary table etc. ?

HTH

Dr Mich Talebzadeh

LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*

http://talebzadehmich.wordpress.com

On 24 June 2016 at 07:58, Jacek Laskowski  wrote:

> Hi,
>
> I've been asking a similar question myself too! Thanks for sending it to
> the mailing list!
>
> Going from a RDD to a Dataset triggers a job to calculate a schema (unless
> the RDD is RDD[Row]).
>
> I *think* that transitioning from a Dataset to a RDD is almost a no op
> since a Dataset requires more to generate underlying data structures and
> optimizations.
>
> Can't wait to hear what more advanced people say.
>
> Jacek
> On 24 Jun 2016 8:00 a.m., "pan"  wrote:
>
> Hello,
>I am trying to understand the cost of converting an RDD to Dataframe and
> back. Would a conversion back and forth very frequently cost performance.
>
> I do observe that some operations like join are implemented very
> differently
> for RDD (pair) and Dataframe so trying to figure out the cose of converting
> one to another
>
> Regards,
> Pranav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Error Invoking Spark on Yarn on using Spark Submit

2016-06-24 Thread puneet kumar

I am getting below error thrown when I submit Spark Job using Spark Submit
on Yarn. Need a quick help on what's going wrong here.

16/06/24 01:09:25 WARN AbstractLifeCycle: FAILED
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter-791eb5d5:
java.lang.IllegalStateException: class
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a
javax.servlet.Filter
java.lang.IllegalStateException: class
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter is not a
javax.servlet.Filter
at 
org.spark-project.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
at 
org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at 
org.spark-project.jetty.servlet.ServletHandler.initialize(ServletHandler.java:768)
at 
org.spark-project.jetty.servlet.ServletContextHandler.startContext(ServletContextHandler.java:265)
at 
org.spark-project.jetty.server.handler.ContextHandler.doStart(ContextHandler.java:717)
at 
org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at 
org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
at 
org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at 
org.spark-project.jetty.server.handler.HandlerCollection.doStart(HandlerCollection.java:229)
at 
org.spark-project.jetty.server.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:172)
at 
org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at 
org.spark-project.jetty.server.handler.HandlerWrapper.doStart(HandlerWrapper.java:95)
at org.spark-project.jetty.server.Server.doStart(Server.java:282)
at 
org.spark-project.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
at 
org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:137)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at 
org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.(SparkContext.scala:481)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:59)

Re: Partitioning in spark

2016-06-24 Thread Darshan Singh

Thanks but the whole point is not setting it explicitly but it should be
derived from its parent RDDS.

Thanks

On Fri, Jun 24, 2016 at 6:09 AM, ayan guha  wrote:

> You can change paralllism like following:
>
> conf = SparkConf()
> conf.set('spark.sql.shuffle.partitions',10)
> sc = SparkContext(conf=conf)
>
>
>
> On Fri, Jun 24, 2016 at 6:46 AM, Darshan Singh 
> wrote:
>
>> Hi,
>>
>> My default parallelism is 100. Now I join 2 dataframes with 20 partitions
>> each , joined dataframe has 100 partition. I want to know what is the way
>> to keep it to 20 (except re-partition and coalesce.
>>
>> Also, when i join these 2 dataframes I am using 4 columns as joined
>> columns. The dataframes are partitions based on first 2 columns of join and
>> thus, in effect one partition should be joined corresponding joins and
>> doesn't need to join with rest of partitions so why spark is shuffling all
>> the data.
>>
>> Simialrly, when my dataframe is partitioned by col1,col2 and if i use
>> group by on col1,col2,col3,col4 then why does it shuffle everything whereas
>> it need to sort each partitions and then should grouping there itself.
>>
>> Bit confusing , I am using 1.5.1
>>
>> Is it fixed in future versions.
>>
>> Thanks
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: Cost of converting RDD's to dataframe and back

2016-06-24 Thread Jacek Laskowski

Hi,

I've been asking a similar question myself too! Thanks for sending it to
the mailing list!

Going from a RDD to a Dataset triggers a job to calculate a schema (unless
the RDD is RDD[Row]).

I *think* that transitioning from a Dataset to a RDD is almost a no op
since a Dataset requires more to generate underlying data structures and
optimizations.

Can't wait to hear what more advanced people say.

Jacek
On 24 Jun 2016 8:00 a.m., "pan"  wrote:

Hello,
   I am trying to understand the cost of converting an RDD to Dataframe and
back. Would a conversion back and forth very frequently cost performance.

I do observe that some operations like join are implemented very differently
for RDD (pair) and Dataframe so trying to figure out the cose of converting
one to another

Regards,
Pranav



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Cost of converting RDD's to dataframe and back

2016-06-24 Thread pan

Hello,
   I am trying to understand the cost of converting an RDD to Dataframe and
back. Would a conversion back and forth very frequently cost performance.

I do observe that some operations like join are implemented very differently
for RDD (pair) and Dataframe so trying to figure out the cose of converting
one to another

Regards,
Pranav



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

45 matches

Mail list logo