date:20150708

S3 vs HDFS

2015-07-08 Thread Brandon White

Are there any significant performance differences between reading text
files from S3 and hdfs?

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Akhil Das

Did you try sc.shutdown and creating a new one?

Thanks
Best Regards

On Wed, Jul 8, 2015 at 8:12 PM, Terry Hole  wrote:

> I am using spark 1.4.1rc1 with default hive settings
>
> Thanks
> - Terry
>
> Hi All,
>
> I'd like to use the hive context in spark shell, i need to recreate the
> hive meta database in the same location, so i want to close the derby
> connection previous created in the spark shell, is there any way to do this?
>
> I try this, but it does not work:
>
> DriverManager.getConnection("jdbc:derby:;shutdown=true");
>
> Thanks!
>
> - Terry
>
>
>

Re: Using Hive UDF in spark

2015-07-08 Thread ayan guha

You are most likely confused because you are using the UDF using
HiveContext. In your case, you are using Spark UDF, not Hive UDF. For a
naive scenario, I can use spark UDFs without any hive installation in my
cluster.

sqlContext.udf.register is for UDF in spark. Hive UDFs are stored in Hive
and you can access them hiveContext.sql() method (just think you are
writing the sql in Hive).

HTH...

Ayan

On Thu, Jul 9, 2015 at 2:33 PM, vinod kumar 
wrote:

> Hi everyone
>
> Shall we use UDF defined in hive using spark sql?
>
> I've created a UDF in spark and registered it using
> sqlContext.udf.register,but when I restarted a service the UDF was not
> available.
> I've heared that Hive UDF's are permanently stored in hive.(Please Correct
> me if I am wrong).
>
> Thanks,
> Vinod
>

-- 
Best Regards,
Ayan Guha

Re: PySpark without PySpark

2015-07-08 Thread Ashish Dutt

Hi Sujit,
Thanks for your response.

So i opened a new notebook using the command ipython notebook --profile
spark and tried the sequence of commands. i am getting errors. Attached is
the screenshot of the same.
Also I am attaching the  00-pyspark-setup.py for your reference. Looks
like, I have written something wrong here. Cannot seem to figure out, what
is it?

Thank you for your help


Sincerely,
Ashish Dutt

On Thu, Jul 9, 2015 at 11:53 AM, Sujit Pal  wrote:

> Hi Ashish,
>
> >> Nice post.
> Agreed, kudos to the author of the post, Benjamin Benfort of District Labs.
>
> >> Following your post, I get this problem;
> Again, not my post.
>
> I did try setting up IPython with the Spark profile for the edX Intro to
> Spark course (because I didn't want to use the Vagrant container) and it
> worked flawlessly with the instructions provided (on OSX). I haven't used
> the IPython/PySpark environment beyond very basic tasks since then though,
> because my employer has a Databricks license which we were already using
> for other stuff and we ended up doing the labs on Databricks.
>
> Looking at your screenshot though, I don't see why you think its picking
> up the default profile. One simple way of checking to see if things are
> working is to open a new notebook and try this sequence of commands:
>
> from pyspark import SparkContext
> sc = SparkContext("local", "pyspark")
> sc
>
> You should see something like this after a little while:
> 
>
> While the context is being instantiated, you should also see lots of log
> lines scroll by on the terminal where you started the "ipython notebook
> --profile spark" command - these log lines are from Spark.
>
> Hope this helps,
> Sujit
>
>
> On Wed, Jul 8, 2015 at 6:04 PM, Ashish Dutt 
> wrote:
>
>> Hi Sujit,
>> Nice post.. Exactly what I had been looking for.
>> I am relatively a beginner with Spark and real time data processing.
>> We have a server with CDH5.4 with 4 nodes. The spark version in our
>> server is 1.3.0
>> On my laptop I have spark 1.3.0 too and its using Windows 7 environment.
>> As per point 5 of your post I am able to invoke pyspark locally as in a
>> standalone mode.
>>
>> Following your post, I get this problem;
>>
>> 1. In section "Using Ipython notebook with spark" I cannot understand why
>> it is picking up the default profile and not the pyspark profile. I am sure
>> it is because of the path variables. Attached is the screenshot. Can you
>> suggest how to solve this.
>>
>> Current the path variables for my laptop are like
>> SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
>> FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", M2_HOME="D:\MAVEN\BIN",
>> MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\"
>>
>>
>> Sincerely,
>> Ashish Dutt
>> PhD Candidate
>> Department of Information Systems
>> University of Malaya, Lembah Pantai,
>> 50603 Kuala Lumpur, Malaysia
>>
>> On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal  wrote:
>>
>>> You are welcome Davies. Just to clarify, I didn't write the post (not
>>> sure if my earlier post gave that impression, apologize if so), although I
>>> agree its great :-).
>>>
>>> -sujit
>>>
>>>
>>> On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu 
>>> wrote:
>>>
 Great post, thanks for sharing with us!

 On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal 
 wrote:
 > Hi Julian,
 >
 > I recently built a Python+Spark application to do search relevance
 > analytics. I use spark-submit to submit PySpark jobs to a Spark
 cluster on
 > EC2 (so I don't use the PySpark shell, hopefully thats what you are
 looking
 > for). Can't share the code, but the basic approach is covered in this
 blog
 > post - scroll down to the section "Writing a Spark Application".
 >
 >
 https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
 >
 > Hope this helps,
 >
 > -sujit
 >
 >
 > On Wed, Jul 8, 2015 at 7:46 AM, Julian 
 wrote:
 >>
 >> Hey.
 >>
 >> Is there a resource that has written up what the necessary steps are
 for
 >> running PySpark without using the PySpark shell?
 >>
 >> I can reverse engineer (by following the tracebacks and reading the
 shell
 >> source) what the relevant Java imports needed are, but I would assume
 >> someone has attempted this before and just published something I can
 >> either
 >> follow or install? If not, I have something that pretty much works
 and can
 >> publish it, but I'm not a heavy Spark user, so there may be some
 things
 >> I've
 >> left out that I haven't hit because of how little of pyspark I'm
 playing
 >> with.
 >>
 >> Thanks,
 >> Julian
 >>
 >>
 >>
 >> --
 >> View this message in context:
 >>
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
 >> Sent from the Apache Spark User List mailing list archive at
>>>

Re: Spark query

2015-07-08 Thread Brandon White

Convert the column to a column of java Timestamps. Then you can do the
following

import java.sql.Timestamp
import java.util.Calendar
def date_trunc(timestamp:Timestamp, timeField:String) = {
  timeField match {
case "hour" =>
  val cal = Calendar.getInstance()
  cal.setTimeInMillis(timestamp.getTime())
  cal.get(Calendar.HOUR_OF_DAY)

case "day" =>
  val cal = Calendar.getInstance()
  cal.setTimeInMillis(timestamp.getTime())
  cal.get(Calendar.DAY)
  }
}

sqlContext.udf.register("date_trunc", date_trunc _)

On Wed, Jul 8, 2015 at 9:23 PM, Harish Butani 
wrote:

> try the spark-datetime package:
> https://github.com/SparklineData/spark-datetime
> Follow this example
> https://github.com/SparklineData/spark-datetime#a-basic-example to get
> the different attributes of a DateTime.
>
> On Wed, Jul 8, 2015 at 9:11 PM, prosp4300  wrote:
>
>> As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs,
>> please take a look below builtin UDFs of Hive, get day of year should be as
>> simply as existing RDBMS
>>
>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
>>
>>
>> At 2015-07-09 12:02:44, "Ravisankar Mani"  wrote:
>>
>> Hi everyone,
>>
>> I can't get 'day of year'  when using spark query. Can you help any way
>> to achieve day of year?
>>
>> Regards,
>> Ravi
>>
>>
>>
>>
>

Using Hive UDF in spark

2015-07-08 Thread vinod kumar

Hi everyone

Shall we use UDF defined in hive using spark sql?

I've created a UDF in spark and registered it using
sqlContext.udf.register,but when I restarted a service the UDF was not
available.
I've heared that Hive UDF's are permanently stored in hive.(Please Correct
me if I am wrong).

Thanks,
Vinod

Re: Writing data to hbase using Sparkstreaming

2015-07-08 Thread Ted Yu

bq.  return new
Tuple2(new ImmutableBytesWritable(), put);

I don't think Put is serializable.

FYI

On Fri, Jun 12, 2015 at 6:40 AM, Vamshi Krishna 
wrote:

> Hi I am trying to write data that is produced from kafka commandline
> producer for some topic. I am facing problem and unable to proceed. Below
> is my code which I am creating a jar and running through spark-submit on
> spark-shell. Am I doing wrong inside foreachRDD() ? What is wrong with
>  SparkKafkaDemo$2.call(SparkKafkaDemo.java:63)   line in below error
> message?
>
>
>
> SparkConf sparkConf = new
> SparkConf().setAppName("JavaKafkaDemo").setMaster("local").setSparkHome("/Users/kvk/softwares/spark-1.3.1-bin-hadoop2.4");
> // Create the context with a 1 second batch size
> JavaStreamingContext jsc = new
> JavaStreamingContext(sparkConf, new Duration(5000));
>
> int numThreads = 2;
> Map topicMap = new HashMap Integer>();
>// topicMap.put("viewTopic", numThreads);
> topicMap.put("nonview", numThreads);
>
> JavaPairReceiverInputDStream messages =
> KafkaUtils.createStream(jsc, "localhost",
> "ViewConsumer", topicMap);
>
> JavaDStream lines = messages.map(new
> Function, String>() {
> @Override
> public String call(Tuple2 tuple2) {
> return tuple2._2();
> }
> });
>
> lines.foreachRDD(new Function, Void>() {
>  @Override
>  public Void call(JavaRDD
> stringJavaRDD) throws Exception {
>
>  JavaPairRDD hbasePuts =
> stringJavaRDD.mapToPair(
>  new PairFunction ImmutableBytesWritable, Put>() {
>  @Override
>  public
> Tuple2 call(String line) throws Exception {
>
>  Put put = new
> Put(Bytes.toBytes("Rowkey" + Math.random()));
>
>  put.addColumn(Bytes.toBytes("firstFamily"), Bytes.toBytes("firstColumn"),
> Bytes.toBytes(line+"fc"));
>  return new
> Tuple2(new ImmutableBytesWritable(), put);
>  }
>  });
>
>  // save to HBase- Spark built-in
> API method
>
>  
> hbasePuts.saveAsNewAPIHadoopDataset(newAPIJobConfiguration1.getConfiguration());
>  return null;
>  }
>  }
> );
> jsc.start();
> jsc.awaitTermination();
>
>
>
>
>
> I see below error on spark-shell.
>
>
> ./bin/spark-submit --class "SparkKafkaDemo" --master local
> /Users/kvk/IntelliJWorkspace/HbaseDemo/HbaseDemo.jar
>
> Exception in thread "main" org.apache.spark.SparkException: Task not
> serializable
>
> at
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
>
> at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
>
> at org.apache.spark.SparkContext.clean(SparkContext.scala:1623)
>
> at org.apache.spark.rdd.RDD.map(RDD.scala:286)
>
> at
> org.apache.spark.api.java.JavaRDDLike$class.mapToPair(JavaRDDLike.scala:113)
>
> at
> org.apache.spark.api.java.AbstractJavaRDDLike.mapToPair(JavaRDDLike.scala:46)
>
> at SparkKafkaDemo$2.call(SparkKafkaDemo.java:63)
>
> at SparkKafkaDemo$2.call(SparkKafkaDemo.java:60)
>
> at
> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:311)
>
> at
> org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:311)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
>
> at
> org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:534)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
>
> at scala.util.

Re: Spark query

2015-07-08 Thread Harish Butani

try the spark-datetime package:
https://github.com/SparklineData/spark-datetime
Follow this example
https://github.com/SparklineData/spark-datetime#a-basic-example to get the
different attributes of a DateTime.

On Wed, Jul 8, 2015 at 9:11 PM, prosp4300  wrote:

> As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs,
> please take a look below builtin UDFs of Hive, get day of year should be as
> simply as existing RDBMS
>
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions
>
>
> At 2015-07-09 12:02:44, "Ravisankar Mani"  wrote:
>
> Hi everyone,
>
> I can't get 'day of year'  when using spark query. Can you help any way to
> achieve day of year?
>
> Regards,
> Ravi
>
>
>
>

SparkR dataFrame read.df fails to read from aws s3

2015-07-08 Thread Ben Spark

I have Spark 1.4 deployed on AWS EMR but methods of SparkR dataFrame read.df 
method cannot load data from aws s3.
1) "read.df" error message 
read.df(sqlContext,"s3://some-bucket/some.json","json")
15/07/09 04:07:01 ERROR r.RBackendHandler: loadDF on 
org.apache.spark.sql.api.r.SQLUtils failed
java.lang.IllegalArgumentException: invalid method loadDF for object 
org.apache.spark.sql.api.r.SQLUtils
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:143)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:74)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:36)  
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 2) "jsonFile" is working though with some warning messageWarning message:
In normalizePath(path) :
  path[1]="s3://rea-consumer-data-dev/cbr/profiler/output/20150618/part-0": 
No such file or directory

Re:Spark query

2015-07-08 Thread prosp4300

As mentioned in Spark sQL programming guide, Spark SQL support Hive UDFs, 
please take a look below builtin UDFs of Hive, get day of year should be as 
simply as existing RDBMS

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DateFunctions




At 2015-07-09 12:02:44, "Ravisankar Mani"  wrote:

Hi everyone,


I can't get 'day of year'  when using spark query. Can you help any way to 
achieve day of year?


Regards,

Ravi

Re: PySpark without PySpark

2015-07-08 Thread Bhupendra Mishra

Very interesting and well organized post. Thanks for sharing

On Wed, Jul 8, 2015 at 10:29 PM, Sujit Pal  wrote:

> Hi Julian,
>
> I recently built a Python+Spark application to do search relevance
> analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on
> EC2 (so I don't use the PySpark shell, hopefully thats what you are looking
> for). Can't share the code, but the basic approach is covered in this blog
> post - scroll down to the section "Writing a Spark Application".
>
> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>
> Hope this helps,
>
> -sujit
>
>
> On Wed, Jul 8, 2015 at 7:46 AM, Julian  wrote:
>
>> Hey.
>>
>> Is there a resource that has written up what the necessary steps are for
>> running PySpark without using the PySpark shell?
>>
>> I can reverse engineer (by following the tracebacks and reading the shell
>> source) what the relevant Java imports needed are, but I would assume
>> someone has attempted this before and just published something I can
>> either
>> follow or install? If not, I have something that pretty much works and can
>> publish it, but I'm not a heavy Spark user, so there may be some things
>> I've
>> left out that I haven't hit because of how little of pyspark I'm playing
>> with.
>>
>> Thanks,
>> Julian
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

回复:Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread prosp4300




Seems what Feynman mentioned is the source code instead of documentation, 
vectorMean is private, see
https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala


At 2015-07-09 10:10:58, "诺铁"  wrote:

thanks, I understand now.
but I can't find mllib.clustering.GaussianMixture#vectorMean   , what version 
of spark do you use?


On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang  wrote:

A RDD[Double] is an abstraction for a large collection of doubles, possibly 
distributed across multiple nodes. The DoubleRDDFunctions are there for 
performing mean and variance calculations across this distributed dataset.


In contrast, a Vector is not distributed and fits on your local machine. You 
would be better off computing these quantities on the Vector directly (see 
mllib.clustering.GaussianMixture#vectorMean for an example of how to compute 
the mean of a vector).


On Tue, Jul 7, 2015 at 8:26 PM, 诺铁  wrote:

hi,


there are some useful functions in DoubleRDDFunctions, which I can use if I 
have RDD[Double], eg, mean, variance.  


Vector doesn't have such methods, how can I convert Vector to RDD[Double], or 
maybe better if I can call mean directly on a Vector?

Spark query

2015-07-08 Thread Ravisankar Mani

Hi everyone,

I can't get 'day of year'  when using spark query. Can you help any way to
achieve day of year?

Regards,
Ravi

Re: Spark program throws NIO Buffer over flow error (TDigest - Ted Dunning lib)

2015-07-08 Thread Ted Yu

Doesn't seem to be Spark problem, assuming TDigest comes from mahout.

Cheers

On Wed, Jul 8, 2015 at 7:49 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Same exception with different values of compression (10,100)
>
>   var digest: TDigest = TDigest.createAvlTreeDigest(100)
>
> On Wed, Jul 8, 2015 at 6:50 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:
>
>> Any suggestions ?
>> Code:
>>
>> val dimQuantiles = genericRecordsAndKeys
>>
>>   .map {
>>
>> case (keyToOutput, rec) =>
>>
>>   var digest: TDigest = TDigest.createAvlTreeDigest(1)
>>
>>   val fpPaidGMB = rec.get("fpPaidGMB").asInstanceOf[Double]
>>
>>   digest.add(fpPaidGMB)
>>
>>   var bbuf: ByteBuffer = ByteBuffer.allocate(digest.byteSize());
>>
>>   digest.asBytes(bbuf);
>>
>>   (keyToOutput.toString, bbuf.array())
>>
>>   }
>>
>>
>> Trace:
>> aborted due to stage failure: Task 4487 in stage 1.0 failed 4 times, most
>> recent failure: Lost task 4487.3 in stage 1.0 (TID 281,
>> phxaishdc9dn0111.phx.ebay.com): java.nio.BufferOverflowException
>> at java.nio.Buffer.nextPutIndex(Buffer.java:519)
>> at java.nio.HeapByteBuffer.putDouble(HeapByteBuffer.java:519)
>> at com.tdunning.math.stats.AVLTreeDigest.asBytes(AVLTreeDigest.java:336)
>> at
>> com.ebay.ep.poc.spark.reporting.process.service.OutlierService$$anonfun$4.apply(OutlierService.scala:96)
>> at
>> com.ebay.ep.poc.spark.reporting.process.service.OutlierService$$anonfun$4.apply(OutlierService.scala:89)
>> at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:197)
>> at
>> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:196)
>> at
>> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:138)
>> at
>> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
>> at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
>> at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>
>>
>> --
>> Deepak
>>
>>
>
>
> --
> Deepak
>
>

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal

Hi Ashish,

>> Nice post.
Agreed, kudos to the author of the post, Benjamin Benfort of District Labs.

>> Following your post, I get this problem;
Again, not my post.

I did try setting up IPython with the Spark profile for the edX Intro to
Spark course (because I didn't want to use the Vagrant container) and it
worked flawlessly with the instructions provided (on OSX). I haven't used
the IPython/PySpark environment beyond very basic tasks since then though,
because my employer has a Databricks license which we were already using
for other stuff and we ended up doing the labs on Databricks.

Looking at your screenshot though, I don't see why you think its picking up
the default profile. One simple way of checking to see if things are
working is to open a new notebook and try this sequence of commands:

from pyspark import SparkContext
sc = SparkContext("local", "pyspark")
sc

You should see something like this after a little while:


While the context is being instantiated, you should also see lots of log
lines scroll by on the terminal where you started the "ipython notebook
--profile spark" command - these log lines are from Spark.

Hope this helps,
Sujit


On Wed, Jul 8, 2015 at 6:04 PM, Ashish Dutt  wrote:

> Hi Sujit,
> Nice post.. Exactly what I had been looking for.
> I am relatively a beginner with Spark and real time data processing.
> We have a server with CDH5.4 with 4 nodes. The spark version in our server
> is 1.3.0
> On my laptop I have spark 1.3.0 too and its using Windows 7 environment.
> As per point 5 of your post I am able to invoke pyspark locally as in a
> standalone mode.
>
> Following your post, I get this problem;
>
> 1. In section "Using Ipython notebook with spark" I cannot understand why
> it is picking up the default profile and not the pyspark profile. I am sure
> it is because of the path variables. Attached is the screenshot. Can you
> suggest how to solve this.
>
> Current the path variables for my laptop are like
> SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
> FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", M2_HOME="D:\MAVEN\BIN",
> MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\"
>
>
> Sincerely,
> Ashish Dutt
> PhD Candidate
> Department of Information Systems
> University of Malaya, Lembah Pantai,
> 50603 Kuala Lumpur, Malaysia
>
> On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal  wrote:
>
>> You are welcome Davies. Just to clarify, I didn't write the post (not
>> sure if my earlier post gave that impression, apologize if so), although I
>> agree its great :-).
>>
>> -sujit
>>
>>
>> On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu 
>> wrote:
>>
>>> Great post, thanks for sharing with us!
>>>
>>> On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal 
>>> wrote:
>>> > Hi Julian,
>>> >
>>> > I recently built a Python+Spark application to do search relevance
>>> > analytics. I use spark-submit to submit PySpark jobs to a Spark
>>> cluster on
>>> > EC2 (so I don't use the PySpark shell, hopefully thats what you are
>>> looking
>>> > for). Can't share the code, but the basic approach is covered in this
>>> blog
>>> > post - scroll down to the section "Writing a Spark Application".
>>> >
>>> >
>>> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>>> >
>>> > Hope this helps,
>>> >
>>> > -sujit
>>> >
>>> >
>>> > On Wed, Jul 8, 2015 at 7:46 AM, Julian 
>>> wrote:
>>> >>
>>> >> Hey.
>>> >>
>>> >> Is there a resource that has written up what the necessary steps are
>>> for
>>> >> running PySpark without using the PySpark shell?
>>> >>
>>> >> I can reverse engineer (by following the tracebacks and reading the
>>> shell
>>> >> source) what the relevant Java imports needed are, but I would assume
>>> >> someone has attempted this before and just published something I can
>>> >> either
>>> >> follow or install? If not, I have something that pretty much works
>>> and can
>>> >> publish it, but I'm not a heavy Spark user, so there may be some
>>> things
>>> >> I've
>>> >> left out that I haven't hit because of how little of pyspark I'm
>>> playing
>>> >> with.
>>> >>
>>> >> Thanks,
>>> >> Julian
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> View this message in context:
>>> >>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
>>> >> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com.
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >
>>>
>>
>>
>

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar

Thanks for the help.

Following are the folders I was trying to write to

*saveAsTextFile("*file:///home/someuser/test2/testupload/20150708/0/")

*saveAsTextFile("f*ile:///home/someuser/test2/testupload/20150708/1/")

*saveAsTextFile("*file:///home/someuser/test2/testupload/20150708/2/")

*saveAsTextFile("*file:///home/someuser/test2/testupload/20150708/3/")


The folder name "test2" was causing issue, for whatever reason the the API
does not recognize file:///home/someuser/test2 as directory.

Once folder name was changed file:///home/someuser/batch/testupload/20150708/0/
, its been working well. I am able to reproduce the issue consistently with
folder name "test2"







On Jul 8, 2015 8:31 PM, "canan chen"  wrote:

> It works for me by using the following code. Could you share your code ?
>
>
> *val data =sc.parallelize(List(1,2,3))*
> *data.saveAsTextFile("file:Users/chen/Temp/c")*
>
> On Thu, Jul 9, 2015 at 4:05 AM, spok20nn  wrote:
>
>> Getting exception when wrting RDD to local disk using following function
>>
>>  saveAsTextFile("file:home/someuser/dir2/testupload/20150708/")
>>
>> The dir (/home/someuser/dir2/testupload/) was created before running the
>> job. The error message is misleading.
>>
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>> in
>> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
>> (TID 6, xxx.yyy.com): org.apache.hadoop.fs.ParentNotDirectoryException:
>> Parent path is not a directory: file:/home/someuser/dir2
>> at
>>
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:418)
>> at
>>
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
>> at
>>
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
>> at
>>
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
>> at
>>
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
>> at
>>
>> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
>> at
>>
>> org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:588)
>> at
>>
>> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:439)
>> at
>>
>> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
>> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
>> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
>> at
>>
>> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
>> at
>> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
>> at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
>> at
>>
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1051)
>> at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> at org.apache.spark.scheduler.Task.run(Task.scala:56)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-saveAsTextFile-to-local-disk-tp23725.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread ashishdutt

Hi,

I get the error, "DLL load failed: %1 is not a valid win32 application"
whenever I invoke pyspark. Attached is the screenshot of the same.
Is there any way I can get rid of it. Still being new to PySpark and have
had,  a not so pleasant experience so far most probably because I am on a
windows environment. 
Therefore, I am afraid that this error might cause me trouble as I continue
my journey exploring pyspark.
I have already check SO and this user list but there are no posts for the
same.

My environment is python version 2.7, OS- Windows 7, Spark- ver 1.3.0
Appreciate your help.

 
Sincerely,
Ashish Dutt error.PNG
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/DLL-load-failed-1-is-not-a-valid-win32-application-on-invoking-pyspark-tp23733.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

DLL load failed: %1 is not a valid win32 application on invoking pyspark

2015-07-08 Thread Ashish Dutt

Hi,

I get the error, "DLL load failed: %1 is not a valid win32 application"
whenever I invoke pyspark. Attached is the screenshot of the same.
Is there any way I can get rid of it. Still being new to PySpark and have
had,  a not so pleasant experience so far most probably because I am on a
windows environment.
Therefore, I am afraid that this error might cause me trouble as I continue
my journey exploring pyspark.
I have already check SO and this user list but there are no posts for the
same.

My environment is python version 2.7, OS- Windows 7, Spark- ver 1.3.0
Appreciate your help.


Sincerely,
Ashish Dutt

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Spark program throws NIO Buffer over flow error (TDigest - Ted Dunning lib)

2015-07-08 Thread ๏̯͡๏

Same exception with different values of compression (10,100)

  var digest: TDigest = TDigest.createAvlTreeDigest(100)

On Wed, Jul 8, 2015 at 6:50 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> Any suggestions ?
> Code:
>
> val dimQuantiles = genericRecordsAndKeys
>
>   .map {
>
> case (keyToOutput, rec) =>
>
>   var digest: TDigest = TDigest.createAvlTreeDigest(1)
>
>   val fpPaidGMB = rec.get("fpPaidGMB").asInstanceOf[Double]
>
>   digest.add(fpPaidGMB)
>
>   var bbuf: ByteBuffer = ByteBuffer.allocate(digest.byteSize());
>
>   digest.asBytes(bbuf);
>
>   (keyToOutput.toString, bbuf.array())
>
>   }
>
>
> Trace:
> aborted due to stage failure: Task 4487 in stage 1.0 failed 4 times, most
> recent failure: Lost task 4487.3 in stage 1.0 (TID 281,
> phxaishdc9dn0111.phx.ebay.com): java.nio.BufferOverflowException
> at java.nio.Buffer.nextPutIndex(Buffer.java:519)
> at java.nio.HeapByteBuffer.putDouble(HeapByteBuffer.java:519)
> at com.tdunning.math.stats.AVLTreeDigest.asBytes(AVLTreeDigest.java:336)
> at
> com.ebay.ep.poc.spark.reporting.process.service.OutlierService$$anonfun$4.apply(OutlierService.scala:96)
> at
> com.ebay.ep.poc.spark.reporting.process.service.OutlierService$$anonfun$4.apply(OutlierService.scala:89)
> at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:197)
> at
> org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:196)
> at
> org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:138)
> at
> org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
> at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
> at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
>
> --
> Deepak
>
>


-- 
Deepak

Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread 诺铁

thanks, I understand now.
but I can't find mllib.clustering.GaussianMixture#vectorMean   , what
version of spark do you use?

On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang  wrote:

> A RDD[Double] is an abstraction for a large collection of doubles,
> possibly distributed across multiple nodes. The DoubleRDDFunctions are
> there for performing mean and variance calculations across this distributed
> dataset.
>
> In contrast, a Vector is not distributed and fits on your local machine.
> You would be better off computing these quantities on the Vector directly
> (see mllib.clustering.GaussianMixture#vectorMean for an example of how to
> compute the mean of a vector).
>
> On Tue, Jul 7, 2015 at 8:26 PM, 诺铁  wrote:
>
>> hi,
>>
>> there are some useful functions in DoubleRDDFunctions, which I can use if
>> I have RDD[Double], eg, mean, variance.
>>
>> Vector doesn't have such methods, how can I convert Vector to
>> RDD[Double], or maybe better if I can call mean directly on a Vector?
>>
>
>

Error while taking union

2015-07-08 Thread anshu shukla

Hi  all ,

I want to create union of 2 DStreams  , in one of them  *RDD is  created
 per  1 second* , other is having RDD generated by reduceByWindowandKey
 with   *duration set to 60 sec.* (slide duration also 60 sec .)

- Main idea is to  do some analysis for every minute data and emitting
union of input data (per sec.)  and   transformed data (per min.) .

Spark program throws NIO Buffer over flow error (TDigest - Ted Dunning lib)

2015-07-08 Thread ๏̯͡๏

Any suggestions ?
Code:

val dimQuantiles = genericRecordsAndKeys

  .map {

case (keyToOutput, rec) =>

  var digest: TDigest = TDigest.createAvlTreeDigest(1)

  val fpPaidGMB = rec.get("fpPaidGMB").asInstanceOf[Double]

  digest.add(fpPaidGMB)

  var bbuf: ByteBuffer = ByteBuffer.allocate(digest.byteSize());

  digest.asBytes(bbuf);

  (keyToOutput.toString, bbuf.array())

  }


Trace:
aborted due to stage failure: Task 4487 in stage 1.0 failed 4 times, most
recent failure: Lost task 4487.3 in stage 1.0 (TID 281,
phxaishdc9dn0111.phx.ebay.com): java.nio.BufferOverflowException
at java.nio.Buffer.nextPutIndex(Buffer.java:519)
at java.nio.HeapByteBuffer.putDouble(HeapByteBuffer.java:519)
at com.tdunning.math.stats.AVLTreeDigest.asBytes(AVLTreeDigest.java:336)
at
com.ebay.ep.poc.spark.reporting.process.service.OutlierService$$anonfun$4.apply(OutlierService.scala:96)
at
com.ebay.ep.poc.spark.reporting.process.service.OutlierService$$anonfun$4.apply(OutlierService.scala:89)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:197)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:196)
at
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:138)
at
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:202)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:56)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)


-- 
Deepak

Re: PySpark without PySpark

2015-07-08 Thread Ashish Dutt

Hi Sujit,
Nice post.. Exactly what I had been looking for.
I am relatively a beginner with Spark and real time data processing.
We have a server with CDH5.4 with 4 nodes. The spark version in our server
is 1.3.0
On my laptop I have spark 1.3.0 too and its using Windows 7 environment. As
per point 5 of your post I am able to invoke pyspark locally as in a
standalone mode.

Following your post, I get this problem;

1. In section "Using Ipython notebook with spark" I cannot understand why
it is picking up the default profile and not the pyspark profile. I am sure
it is because of the path variables. Attached is the screenshot. Can you
suggest how to solve this.

Current the path variables for my laptop are like
SPARK_HOME="C:\SPARK-1.3.0\BIN", JAVA_HOME="C:\PROGRAM
FILES\JAVA\JDK1.7.0_79", HADOOP_HOME="D:\WINUTILS", M2_HOME="D:\MAVEN\BIN",
MAVEN_HOME="D:\MAVEN\BIN", PYTHON_HOME="C:\PYTHON27\", SBT_HOME="C:\SBT\"


Sincerely,
Ashish Dutt
PhD Candidate
Department of Information Systems
University of Malaya, Lembah Pantai,
50603 Kuala Lumpur, Malaysia

On Thu, Jul 9, 2015 at 4:56 AM, Sujit Pal  wrote:

> You are welcome Davies. Just to clarify, I didn't write the post (not sure
> if my earlier post gave that impression, apologize if so), although I agree
> its great :-).
>
> -sujit
>
>
> On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu  wrote:
>
>> Great post, thanks for sharing with us!
>>
>> On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal  wrote:
>> > Hi Julian,
>> >
>> > I recently built a Python+Spark application to do search relevance
>> > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster
>> on
>> > EC2 (so I don't use the PySpark shell, hopefully thats what you are
>> looking
>> > for). Can't share the code, but the basic approach is covered in this
>> blog
>> > post - scroll down to the section "Writing a Spark Application".
>> >
>> >
>> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>> >
>> > Hope this helps,
>> >
>> > -sujit
>> >
>> >
>> > On Wed, Jul 8, 2015 at 7:46 AM, Julian 
>> wrote:
>> >>
>> >> Hey.
>> >>
>> >> Is there a resource that has written up what the necessary steps are
>> for
>> >> running PySpark without using the PySpark shell?
>> >>
>> >> I can reverse engineer (by following the tracebacks and reading the
>> shell
>> >> source) what the relevant Java imports needed are, but I would assume
>> >> someone has attempted this before and just published something I can
>> >> either
>> >> follow or install? If not, I have something that pretty much works and
>> can
>> >> publish it, but I'm not a heavy Spark user, so there may be some things
>> >> I've
>> >> left out that I haven't hit because of how little of pyspark I'm
>> playing
>> >> with.
>> >>
>> >> Thanks,
>> >> Julian
>> >>
>> >>
>> >>
>> >> --
>> >> View this message in context:
>> >>
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
>> >> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Communication between driver, cluster and HiveServer

2015-07-08 Thread Eric Pederson

A couple of other things.

The Spark Notebook application does have hive-site.xml in its classpath.
It is a copy of the original $SPARK_HOME/conf/hive-site.xml that worked for
spark-shell originally   After the security tweaks were made to
$SPARK_HOME/conf/hive-site.xml, Spark Notebook started working.  But the
same tweaks did *not* need to be applied to the copy that is in the Spark
Notebook's classpath.

I'm running Spark 1.3.1, Hive 0.13.1 and MapR 4.1.0.  The tweaks were
hive.metastore.sasl.enabled=false, hive.server2.authentication=PAM, and
hive.server2.authentication.pam.services=login,sshd,sudo.

Thanks,

-- Eric

On Wed, Jul 8, 2015 at 2:27 PM, Eric Pederson  wrote:

> All:
>
> I recently ran into a scenario where spark-shell could communicate with
> Hive but another application of mine (Spark Notebook) could not.  When I
> tried to get a reference to a table using sql.table("tab") Spark Notebook
> threw an exception but spark-shell did not.
>
> I was trying to track down the difference between the two applications and
> was having a hard time figuring out what it was.
>
> The problem was resolved by tweaking a hive-site.xml security setting,
> but I'm still curious about how it works.
>
> It seems like spark-shell knows how to look at
> $SPARK_HOME/conf/hive-site.xml and communicate with the HiveServer
> directly.  But my other application doesn't know anything about
> hive-site.xml and must communicate with another piece of Spark to get the
> information.  Originally this indirect communication didn't work, but after
> the tweak to hive-site.xml it does.
>
> How does the communication between the driver and Hive work?  And is
> spark-shell somehow special in this regard?
>
> Thanks,
>
> -- Eric
>

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread canan chen

It works for me by using the following code. Could you share your code ?


*val data =sc.parallelize(List(1,2,3))*
*data.saveAsTextFile("file:Users/chen/Temp/c")*

On Thu, Jul 9, 2015 at 4:05 AM, spok20nn  wrote:

> Getting exception when wrting RDD to local disk using following function
>
>  saveAsTextFile("file:home/someuser/dir2/testupload/20150708/")
>
> The dir (/home/someuser/dir2/testupload/) was created before running the
> job. The error message is misleading.
>
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in
> stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
> (TID 6, xxx.yyy.com): org.apache.hadoop.fs.ParentNotDirectoryException:
> Parent path is not a directory: file:/home/someuser/dir2
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:418)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
> at
> org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:588)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:439)
> at
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
> at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
> at
>
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
> at
> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
> at
>
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1051)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-saveAsTextFile-to-local-disk-tp23725.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: pause and resume streaming app

2015-07-08 Thread Tathagata Das

Currently the only way to pause it is to stop it. The way I would do this
is use the Direct Kafka API to access the Kafka offsets, and save them to a
data store as batches finish. If you see a batch job failing because
downstream is down, stop the context. When it comes back up, start a new
streaming context using the saved offsets as starting points.

TD

On Wed, Jul 8, 2015 at 11:52 AM, Shushant Arora 
wrote:

> Is it possible to pause and resume a streaming app?
>
> I have a streaming app which reads events from kafka and post to some
> external source. I want to pause the app when external source is down and
>  resume it automatically when it comes back ?
>
> Is it possible to pause the app and is it possible to pause only
> processing part let receiver work and brong data in memory+disk, once
> external source is up process whole data in one go.
>
> Thanks
> Shushant
>

What does RDD lineage refer to ?

2015-07-08 Thread canan chen

Lots of places refer RDD lineage, I'd like to know what it refer to
exactly.  My understanding is that it means the RDD dependencies and the
intermediate MapOutput info in MapOutputTracker.  Correct me if I am wrong.
Thanks

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza

Strange.  Does the application show up at all in the YARN web UI?
Does application_1436314873375_0030
show up at all in the YARN ResourceManager logs?

-Sandy

On Wed, Jul 8, 2015 at 3:32 PM, Juan Gordon  wrote:

> Hello Sandy,
>
> Yes I'm sure that YARN has the enought resources, i checked it in the WEB
> UI page of my cluster
>
> Also, i'm able to submit the same script in any of the nodes of the
> cluster.
>
> That's why i don't understand whats happening.
>
> Thanks
>
> JG
>
> On Wed, Jul 8, 2015 at 5:26 PM, Sandy Ryza 
> wrote:
>
>> Hi JG,
>>
>> One way this can occur is that YARN doesn't have enough resources to run
>> your job.  Have you verified that it does?  Are you able to submit using
>> the same command from a node on the cluster?
>>
>> -Sandy
>>
>> On Wed, Jul 8, 2015 at 3:19 PM, jegordon  wrote:
>>
>>> I'm trying to submit a spark job from a different server outside of my
>>> Spark
>>> Cluster (running spark 1.4.0, hadoop 2.4.0 and YARN) using the
>>> spark-submit
>>> script :
>>>
>>> spark/bin/spark-submit --master yarn-client --executor-memory 4G
>>> myjobScript.py
>>>
>>> The think is that my application never pass from the accepted state, it
>>> stuck on it :
>>>
>>> 15/07/08 16:49:40 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:41 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:42 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:43 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:44 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:45 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:46 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:47 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:48 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>> 15/07/08 16:49:49 INFO Client: Application report for
>>> application_1436314873375_0030 (state: ACCEPTED)
>>>
>>> But if i execute the same script with spark-submit in the master server
>>> of
>>> my cluster it runs correctly.
>>>
>>> I already set the yarn configuration in the remote server in
>>> $YARN_CONF_DIR/yarn-site.xml like this :
>>>
>>>  
>>> yarn.resourcemanager.hostname
>>> 54.54.54.54
>>>  
>>>
>>>  
>>>yarn.resourcemanager.address
>>>54.54.54.54:8032
>>>Enter your ResourceManager hostname.
>>>  
>>>
>>>  
>>>yarn.resourcemanager.scheduler.address
>>>54.54.54.54:8030
>>>Enter your ResourceManager hostname.
>>>  
>>>
>>>  
>>>yarn.resourcemanager.resourcetracker.address
>>>54.54.54.54:8031
>>>Enter your ResourceManager hostname.
>>>  
>>> Where 54.54.54.54 is the IP of my resourcemanager node.
>>>
>>> Why is this happening? do i have to configure something else in YARN to
>>> accept remote submits? or what am i missing?
>>>
>>> Thanks a lot
>>>
>>> JG
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Remote-spark-submit-not-working-with-YARN-tp23728.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> Saludos,
> Juan Gordon
>

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Tathagata Das

This is also discussed in the programming guide.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

On Wed, Jul 8, 2015 at 8:25 AM, Dmitry Goldenberg 
wrote:

> Thanks, Sean.
>
> "are you asking about foreach vs foreachPartition? that's quite
> different. foreachPartition does not give more parallelism but lets
> you operate on a whole batch of data at once, which is nice if you
> need to allocate some expensive resource to do the processing"
>
> This is basically what I was looking for.
>
>
> On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen  wrote:
>
>> @Evo There is no foreachRDD operation on RDDs; it is a method of
>> DStream. It gives each RDD in the stream. RDD has a foreach, and
>> foreachPartition. These give elements of an RDD. What do you mean it
>> 'works' to call foreachRDD on an RDD?
>>
>> @Dmitry are you asking about foreach vs foreachPartition? that's quite
>> different. foreachPartition does not give more parallelism but lets
>> you operate on a whole batch of data at once, which is nice if you
>> need to allocate some expensive resource to do the processing.
>>
>> On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
>>  wrote:
>> > "These are quite different operations. One operates on RDDs in  DStream
>> and
>> > one operates on partitions of an RDD. They are not alternatives."
>> >
>> > Sean, different operations as they are, they can certainly be used on
>> the
>> > same data set.  In that sense, they are alternatives. Code can be
>> written
>> > using one or the other which reaches the same effect - likely at a
>> different
>> > efficiency cost.
>> >
>> > The question is, what are the effects of applying one vs. the other?
>> >
>> > My specific scenario is, I'm streaming data out of Kafka.  I want to
>> perform
>> > a few transformations then apply an action which results in e.g. writing
>> > this data to Solr.  According to Evo, my best bet is foreachPartition
>> > because of increased parallelism (which I'd need to grok to understand
>> the
>> > details of what that means).
>> >
>> > Another scenario is, I've done a few transformations and send a result
>> > somewhere, e.g. I write a message into a socket.  Let's say I have one
>> > socket per a client of my streaming app and I get a host:port of that
>> socket
>> > as part of the message and want to send the response via that socket.
>> Is
>> > foreachPartition still a better choice?
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen  wrote:
>> >>
>> >> These are quite different operations. One operates on RDDs in  DStream
>> and
>> >> one operates on partitions of an RDD. They are not alternatives.
>> >>
>> >>
>> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg 
>> wrote:
>> >>>
>> >>> Is there a set of best practices for when to use foreachPartition vs.
>> >>> foreachRDD?
>> >>>
>> >>> Is it generally true that using foreachPartition avoids some of the
>> >>> over-network data shuffling overhead?
>> >>>
>> >>> When would I definitely want to use one method vs. the other?
>> >>>
>> >>> Thanks.
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> >>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com.
>> >>>
>> >>> -
>> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >>> For additional commands, e-mail: user-h...@spark.apache.org
>> >>>
>> >
>>
>
>

Re: Problem in Understanding concept of Physical Cores

2015-07-08 Thread Tathagata Das

There are several levels of indirection going on here, let me clarify.

In the local mode, Spark runs tasks (which includes receivers) using the
number of threads defined in the master (either local, or local[2], or
local[*]).
local or local[1] = single thread, so only one task at a time
local[2] = 2 threads, so two tasks
local[*] = as many threads as the number cores it can detect through the
operating system.


Test 1: When you dont specify master in spark-submit, it uses local[*]
implicitly, so it uses as many threads as the number of cores that VM has.
Between 1 and 2 VM cores, the behavior was as expected.
Test 2: When you specified master as local[2], it used two threads.

HTH

TD

On Wed, Jul 8, 2015 at 4:21 AM, Aniruddh Sharma 
wrote:

> Hi
>
> I am new to Spark. Following is the problem that I am facing
>
> Test 1) I ran a VM on CDH distribution with only 1 core allocated to it
> and I ran simple Streaming example in spark-shell with sending data on 
> port and trying to read it. With 1 core allocated to this nothing happens
> in my streaming program and it does not receive data. Now I restart VM with
> 2 cores allocated to it and start spark-shell again and ran Streaming
> example again and this time it works
>
> Query a): From this test I concluded that Receiver in Streaming will
> occupy the core completely even though I am using very less data and it
> does not need complete core for same
> but it does not assign this core to Executor for calculating
> transformation.  And doing comparison of Partition processing and Receiver
> processing is that in case of Partitions same
> physical cores can parallelly process multiple partitions but Receiver
> will not allow its core to process anything else. Is this understanding
> correct
>
> Test2) Now I restarted VM with 1 core again and start spark-shell --master
> local[2]. I have allocated only 1 core to VM but i say to spark-shell to
> use 2 cores. and I test streaming program again and it somehow works.
>
> Query b) Now I am more confused and I dont understand when I have only 1
> core for VM. I thought previously it did not work because it had only 1
> core and Receiver is completely blocking it and not sharing it with
> Executor. But when I do start with local[2] and still having only 1 core to
> VM it works. So it means that Receiver and Executor are both getting same
> physical CPU. Request you to explain how is it different in this case and
> what conclusions shall I draw in context of physical CPU usage.
>
> Thanks and Regards
> Aniruddh
>
>

Re: How to change hive database?

2015-07-08 Thread Arun Luthra

Thanks, it works.

On Tue, Jul 7, 2015 at 11:15 AM, Ted Yu  wrote:

> See this thread http://search-hadoop.com/m/q3RTt0NFls1XATV02
>
> Cheers
>
> On Tue, Jul 7, 2015 at 11:07 AM, Arun Luthra 
> wrote:
>
>>
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.hive.HiveContext
>>
>> I'm getting org.apache.spark.sql.catalyst.analysis.NoSuchTableException
>> from:
>>
>> val dataframe = hiveContext.table("other_db.mytable")
>>
>> Do I have to change current database to access it? Is it possible to do
>> this? I'm guessing that the "database.table" syntax that I used in
>> hiveContext.table is not recognized.
>>
>> I have no problems accessing tables in the database called "default".
>>
>> I can list tables in "other_db" with hiveContext.tableNames("other_db")
>>
>> Using Spark 1.4.0.
>>
>>
>>
>

Re: FW: MLLIB (Spark) Question.

2015-07-08 Thread DB Tsai

Hi Dhar,

Disabling `standardization` feature is just merged in master.

https://github.com/apache/spark/commit/57221934e0376e5bb8421dc35d4bf91db4deeca1

Let us know your feedback. Thanks.

Sincerely,

DB Tsai
--
Blog: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Tue, Jun 16, 2015 at 9:11 PM, Dhar Sauptik (CR/RTC1.3-NA)
 wrote:
> Hi DB,
>
> That will work too. I was just suggesting that as standardization is a simple 
> operation and could have been performed explicitly.
>
> Thank you for the replies.
>
> -Sauptik.
>
> -Original Message-
> From: DB Tsai [mailto:dbt...@dbtsai.com]
> Sent: Tuesday, June 16, 2015 9:04 PM
> To: Dhar Sauptik (CR/RTC1.3-NA)
> Cc: Ramakrishnan Naveen (CR/RTC1.3-NA); user@spark.apache.org
> Subject: Re: FW: MLLIB (Spark) Question.
>
> Hi Dhar,
>
> For "standardization", we can disable it effectively by using
> different regularization on each component. Thus, we're solving the
> same problem but having better rate of convergence. This is one of the
> features I will implement.
>
> Sincerely,
>
> DB Tsai
> --
> Blog: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Tue, Jun 16, 2015 at 8:34 PM, Dhar Sauptik (CR/RTC1.3-NA)
>  wrote:
>> Hi DB,
>>
>> Thank you for the reply. The answers makes sense. I do have just one more 
>> point to add.
>>
>> Note that it may be better to not implicitly standardize the data. Agreed 
>> that a number of algorithms benefit from such standardization, but for many 
>> applications with contextual information such standardization "may" not be 
>> desirable.
>> Users can always perform the standardization themselves.
>>
>> However, that's just a suggestion. Again, thank you for the clarification.
>>
>> Thanks,
>> Sauptik.
>>
>>
>> -Original Message-
>> From: DB Tsai [mailto:dbt...@dbtsai.com]
>> Sent: Tuesday, June 16, 2015 2:49 PM
>> To: Dhar Sauptik (CR/RTC1.3-NA); Ramakrishnan Naveen (CR/RTC1.3-NA)
>> Cc: user@spark.apache.org
>> Subject: Re: FW: MLLIB (Spark) Question.
>>
>> +cc user@spark.apache.org
>>
>> Reply inline.
>>
>> On Tue, Jun 16, 2015 at 2:31 PM, Dhar Sauptik (CR/RTC1.3-NA)
>>  wrote:
>>> Hi DB,
>>>
>>> Thank you for the reply. That explains a lot.
>>>
>>> I however had a few points regarding this:-
>>>
>>> 1. Just to help with the debate of not regularizing the b parameter. A 
>>> standard implementation argues against regularizing the b parameter. See Pg 
>>> 64 para 1 :  http://statweb.stanford.edu/~tibs/ElemStatLearn/
>>>
>>
>> Agreed. We just worry about it will change behavior, but we actually
>> have a PR to change the behavior to standard one,
>> https://github.com/apache/spark/pull/6386
>>
>>> 2. Further, is the regularization of b also applicable for the SGD 
>>> implementation. Currently the SGD vs. BFGS implementations give different 
>>> results (and both the implementations don't match the IRLS algorithm). Are 
>>> the SGD/BFGS implemented for different loss functions? Can you please share 
>>> your thoughts on this.
>>>
>>
>> In SGD implementation, we don't "standardize" the dataset before
>> training. As a result, those columns with low standard deviation will
>> be penalized more, and those with high standard deviation will be
>> penalized less. Also, "standardize" will help the rate of convergence.
>> As a result, in most of package, they "standardize" the data
>> implicitly, and get the weights in the "standardized" space, and
>> transform back to original space so it's transparent for users.
>>
>> 1) LORWithSGD: No standardization, and penalize the intercept.
>> 2) LORWithLBFGS: With standardization but penalize the intercept.
>> 3) New LOR implementation: With standardization without penalizing the
>> intercept.
>>
>> As a result, only the new implementation in Spark ML handles
>> everything correctly. We have tests to verify that the results match
>> R.
>>
>>>
>>> @Naveen: Please feel free to add/comment on the above points as you see 
>>> necessary.
>>>
>>> Thanks,
>>> Sauptik.
>>>
>>> -Original Message-
>>> From: DB Tsai
>>> Sent: Tuesday, June 16, 2015 2:08 PM
>>> To: Ramakrishnan Naveen (CR/RTC1.3-NA)
>>> Cc: Dhar Sauptik (CR/RTC1.3-NA)
>>> Subject: Re: FW: MLLIB (Spark) Question.
>>>
>>> Hey,
>>>
>>> In the LORWithLBFGS api you use, the intercept is regularized while
>>> other implementations don't regularize the intercept. That's why you
>>> see the difference.
>>>
>>> The intercept should not be regularized, so we fix this in new Spark
>>> ML api in spark 1.4. Since this will change the behavior in the old
>>> api if we decide to not regularize the intercept in old version, we
>>> are still debating about this.
>>>
>>> See the following code for full running example in spark 1.4
>>> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/LogisticRegressionExample.scala
>>>
>>> And also check out my talk at spark summit.
>

Re: Remote spark-submit not working with YARN

2015-07-08 Thread Sandy Ryza

Hi JG,

One way this can occur is that YARN doesn't have enough resources to run
your job.  Have you verified that it does?  Are you able to submit using
the same command from a node on the cluster?

-Sandy

On Wed, Jul 8, 2015 at 3:19 PM, jegordon  wrote:

> I'm trying to submit a spark job from a different server outside of my
> Spark
> Cluster (running spark 1.4.0, hadoop 2.4.0 and YARN) using the spark-submit
> script :
>
> spark/bin/spark-submit --master yarn-client --executor-memory 4G
> myjobScript.py
>
> The think is that my application never pass from the accepted state, it
> stuck on it :
>
> 15/07/08 16:49:40 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:41 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:42 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:43 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:44 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:45 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:46 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:47 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:48 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
> 15/07/08 16:49:49 INFO Client: Application report for
> application_1436314873375_0030 (state: ACCEPTED)
>
> But if i execute the same script with spark-submit in the master server of
> my cluster it runs correctly.
>
> I already set the yarn configuration in the remote server in
> $YARN_CONF_DIR/yarn-site.xml like this :
>
>  
> yarn.resourcemanager.hostname
> 54.54.54.54
>  
>
>  
>yarn.resourcemanager.address
>54.54.54.54:8032
>Enter your ResourceManager hostname.
>  
>
>  
>yarn.resourcemanager.scheduler.address
>54.54.54.54:8030
>Enter your ResourceManager hostname.
>  
>
>  
>yarn.resourcemanager.resourcetracker.address
>54.54.54.54:8031
>Enter your ResourceManager hostname.
>  
> Where 54.54.54.54 is the IP of my resourcemanager node.
>
> Why is this happening? do i have to configure something else in YARN to
> accept remote submits? or what am i missing?
>
> Thanks a lot
>
> JG
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Remote-spark-submit-not-working-with-YARN-tp23728.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Remote spark-submit not working with YARN

2015-07-08 Thread jegordon

I'm trying to submit a spark job from a different server outside of my Spark
Cluster (running spark 1.4.0, hadoop 2.4.0 and YARN) using the spark-submit
script :

spark/bin/spark-submit --master yarn-client --executor-memory 4G
myjobScript.py

The think is that my application never pass from the accepted state, it
stuck on it :

15/07/08 16:49:40 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:41 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:42 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:43 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:44 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:45 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:46 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:47 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:48 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)
15/07/08 16:49:49 INFO Client: Application report for
application_1436314873375_0030 (state: ACCEPTED)

But if i execute the same script with spark-submit in the master server of
my cluster it runs correctly.

I already set the yarn configuration in the remote server in
$YARN_CONF_DIR/yarn-site.xml like this :

 
yarn.resourcemanager.hostname
54.54.54.54
 

 
   yarn.resourcemanager.address
   54.54.54.54:8032
   Enter your ResourceManager hostname.
 

 
   yarn.resourcemanager.scheduler.address
   54.54.54.54:8030
   Enter your ResourceManager hostname.
 

 
   yarn.resourcemanager.resourcetracker.address
   54.54.54.54:8031
   Enter your ResourceManager hostname.
 
Where 54.54.54.54 is the IP of my resourcemanager node.

Why is this happening? do i have to configure something else in YARN to
accept remote submits? or what am i missing?

Thanks a lot

JG




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Remote-spark-submit-not-working-with-YARN-tp23728.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg

Richard,
It seems whether I'm doing a foreachRDD or foreachPartition, I'm able to
create per-worker/per-JVM singletons. With 4 workers, I've got 4 singletons
created.  I wouldn't be able to use broadcast vars because the 3rd party
objects are not serializable.

The shuffling effect is basically whenever Spark has to pull data from
multiple machines together over the network when executing an action.
Probably not an issue for foreachRDD, but more for such actions as 'union'
or 'subtract' and the like.

On Wed, Jul 8, 2015 at 3:55 PM, Richard Marscher 
wrote:

> Ah, I see this is streaming. I haven't any practical experience with that
> side of Spark. But the foreachPartition idea is a good approach. I've used
> that pattern extensively, even though not for singletons, but just to
> create non-serializable objects like API and DB clients on the executor
> side. I think it's the most straightforward approach to dealing with any
> non-serializable object you need.
>
> I don't entirely follow what over-network data shuffling effects you are
> alluding to (maybe more specific to streaming?).
>
> On Wed, Jul 8, 2015 at 9:41 AM, Dmitry Goldenberg <
> dgoldenberg...@gmail.com> wrote:
>
>> My singletons do in fact stick around. They're one per worker, looks
>> like.  So with 4 workers running on the box, we're creating one singleton
>> per worker process/jvm, which seems OK.
>>
>> Still curious about foreachPartition vs. foreachRDD though...
>>
>> On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher <
>> rmarsc...@localytics.com> wrote:
>>
>>> Would it be possible to have a wrapper class that just represents a
>>> reference to a singleton holding the 3rd party object? It could proxy over
>>> calls to the singleton object which will instantiate a private instance of
>>> the 3rd party object lazily? I think something like this might work if the
>>> workers have the singleton object in their classpath.
>>>
>>> here's a rough sketch of what I was thinking:
>>>
>>> object ThirdPartySingleton {
>>>   private lazy val thirdPartyObj = ...
>>>
>>>   def someProxyFunction() = thirdPartyObj.()
>>> }
>>>
>>> class ThirdPartyReference extends Serializable {
>>>   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
>>> }
>>>
>>> also found this SO post:
>>> http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers
>>>
>>>
>>> On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg 
>>> wrote:
>>>
 Hi,

 I am seeing a lot of posts on singletons vs. broadcast variables, such
 as
 *

 http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
 *

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219

 What's the best approach to instantiate an object once and have it be
 reused
 by the worker(s).

 E.g. I have an object that loads some static state such as e.g. a
 dictionary/map, is a part of 3rd party API and is not serializable.  I
 can't
 seem to get it to be a singleton on the worker side as the JVM appears
 to be
 wiped on every request so I get a new instance.  So the singleton
 doesn't
 stick.

 Is there an approach where I could have this object or a wrapper of it
 be a
 broadcast var? Can Kryo get me there? would that basically mean writing
 a
 custom serializer?  However, the 3rd party object may have a bunch of
 member
 vars hanging off it, so serializing it properly may be non-trivial...

 Any pointers/hints greatly appreciated.

 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

>>>
>>
>

Re: Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt

The error is JVM has not responded after 10 seconds.
On 08-Jul-2015 10:54 PM, "ayan guha"  wrote:

> What's the error you are getting?
> On 9 Jul 2015 00:01, "Ashish Dutt"  wrote:
>
>> Hi,
>>
>> We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
>> days I have been trying to connect my laptop to the server using spark
>>  but its been unsucessful.
>> The server contains data that needs to be cleaned and analysed.
>> The cluster and the nodes are on linux environment.
>> To connect to the nodes I am usnig SSH
>>
>> Question: Would it be better if I work directly on the nodes rather than
>> trying to connect my laptop to them ?
>> Question 2: If yes, then can you suggest any python and R IDE that I can
>> install on the nodes to make it work?
>>
>> Thanks for your help
>>
>>
>> Sincerely,
>> Ashish Dutt
>>
>>

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Andrew Or

Hi Lincoln, I've noticed this myself. I believe it's a new issue that only
affects local mode. I've filed a JIRA to track it:
https://issues.apache.org/jira/browse/SPARK-8911

2015-07-08 14:20 GMT-07:00 Lincoln Atkinson :

>  Brilliant! Thanks.
>
>
>
> *From:* Feynman Liang [mailto:fli...@databricks.com]
> *Sent:* Wednesday, July 08, 2015 2:15 PM
> *To:* Lincoln Atkinson
> *Cc:* user@spark.apache.org
> *Subject:* Re: Disable heartbeat messages in REPL
>
>
>
> I was thinking the same thing! Try sc.setLogLevel("ERROR")
>
>
>
> On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson 
> wrote:
>
>  “WARN Executor: Told to re-register on heartbeat” is logged repeatedly
> in the spark shell, which is very distracting and corrupts the display of
> whatever set of commands I’m currently typing out.
>
>
>
> Is there an option to disable the logging of this message?
>
>
>
> Thanks,
>
> -Lincoln
>
>
>

Re: Real-time data visualization with Zeppelin

2015-07-08 Thread Brandon White

Can you use a con job to update it every X minutes?

On Wed, Jul 8, 2015 at 2:23 PM, Ganelin, Ilya 
wrote:

> Hi all – I’m just wondering if anyone has had success integrating Spark
> Streaming with Zeppelin and actually dynamically updating the data in near
> real-time. From my investigation, it seems that Zeppelin will only allow
> you to display a snapshot of data, not a continuously updating table. Has
> anyone figured out if there’s a way to loop a display command or how to
> provide a mechanism to continuously update visualizations?
>
> Thank you,
> Ilya Ganelin
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Real-time data visualization with Zeppelin

2015-07-08 Thread Ganelin, Ilya

Hi all – I’m just wondering if anyone has had success integrating Spark 
Streaming with Zeppelin and actually dynamically updating the data in near 
real-time. From my investigation, it seems that Zeppelin will only allow you to 
display a snapshot of data, not a continuously updating table. Has anyone 
figured out if there’s a way to loop a display command or how to provide a 
mechanism to continuously update visualizations?

Thank you,
Ilya Ganelin

[cid:0042A8D7-6242-41E8-80ED-0D0CC16C96B5]


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.

Requirement failed: Some of the DStreams have different slide durations

2015-07-08 Thread anshu shukla

Hi  all ,

I want to create union of 2 DStreams  , in one of them  *RDD is  created
 per  1 second* , other is having RDD generated by reduceByWindowandKey
 with   *duration set to 60 sec.* (slide duration also 60 sec .)

- Main idea is to  do some analysis for every minute data and emitting
union of input data (per sec.)  and   transformed data (per min.) .


*Code is -*

JavaPairDStream windowedGridCounts =
GridtoPair.reduceByKeyAndWindow(new Function2() {
@Override public String call(String i1, String i2) {

long  id1= MsgIdAddandRemove.getMessageId(i1);
long  id2= MsgIdAddandRemove.getMessageId(i2);
Float  v1= Float.parseFloat(MsgIdAddandRemove.getMessageContent(i1));
Float  v2= Float.parseFloat(MsgIdAddandRemove.getMessageContent(i1));
String res= String.valueOf(v1+v2);
if(id1>id2) {
return MsgIdAddandRemove.addMessageId(res, id1);
}
else{
return MsgIdAddandRemove.addMessageId(res,id2);
}
}}, *Durations.seconds(60),Durations.seconds(60));*

*JavaDStream
UnionStream=tollPercent.union(underPay).union(taxSumPercent);*


*Getting the following error -*

*Exception in thread "main" java.lang.IllegalArgumentException: requirement
failed: Some of the DStreams have different slide durations at
scala.Predef$.require(Predef.scala:233) at
org.apache.spark.streaming.dstream.UnionDStream.(UnionDStream.scala:33)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$union$1.apply(DStream.scala:849)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$union$1.apply(DStream.scala:849)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109)*

-- 
Thanks & Regards,
Anshu Shukla

RE: Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson

Brilliant! Thanks.

From: Feynman Liang [mailto:fli...@databricks.com]
Sent: Wednesday, July 08, 2015 2:15 PM
To: Lincoln Atkinson
Cc: user@spark.apache.org
Subject: Re: Disable heartbeat messages in REPL

I was thinking the same thing! Try sc.setLogLevel("ERROR")

On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson 
mailto:lat...@microsoft.com>> wrote:
“WARN Executor: Told to re-register on heartbeat” is logged repeatedly in the 
spark shell, which is very distracting and corrupts the display of whatever set 
of commands I’m currently typing out.

Is there an option to disable the logging of this message?

Thanks,
-Lincoln

Re: Disable heartbeat messages in REPL

2015-07-08 Thread Feynman Liang

I was thinking the same thing! Try sc.setLogLevel("ERROR")

On Wed, Jul 8, 2015 at 2:01 PM, Lincoln Atkinson 
wrote:

>  “WARN Executor: Told to re-register on heartbeat” is logged repeatedly
> in the spark shell, which is very distracting and corrupts the display of
> whatever set of commands I’m currently typing out.
>
>
>
> Is there an option to disable the logging of this message?
>
>
>
> Thanks,
>
> -Lincoln
>

Disable heartbeat messages in REPL

2015-07-08 Thread Lincoln Atkinson

"WARN Executor: Told to re-register on heartbeat" is logged repeatedly in the 
spark shell, which is very distracting and corrupts the display of whatever set 
of commands I'm currently typing out.

Is there an option to disable the logging of this message?

Thanks,
-Lincoln

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal

You are welcome Davies. Just to clarify, I didn't write the post (not sure
if my earlier post gave that impression, apologize if so), although I agree
its great :-).

-sujit


On Wed, Jul 8, 2015 at 10:36 AM, Davies Liu  wrote:

> Great post, thanks for sharing with us!
>
> On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal  wrote:
> > Hi Julian,
> >
> > I recently built a Python+Spark application to do search relevance
> > analytics. I use spark-submit to submit PySpark jobs to a Spark cluster
> on
> > EC2 (so I don't use the PySpark shell, hopefully thats what you are
> looking
> > for). Can't share the code, but the basic approach is covered in this
> blog
> > post - scroll down to the section "Writing a Spark Application".
> >
> >
> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
> >
> > Hope this helps,
> >
> > -sujit
> >
> >
> > On Wed, Jul 8, 2015 at 7:46 AM, Julian 
> wrote:
> >>
> >> Hey.
> >>
> >> Is there a resource that has written up what the necessary steps are for
> >> running PySpark without using the PySpark shell?
> >>
> >> I can reverse engineer (by following the tracebacks and reading the
> shell
> >> source) what the relevant Java imports needed are, but I would assume
> >> someone has attempted this before and just published something I can
> >> either
> >> follow or install? If not, I have something that pretty much works and
> can
> >> publish it, but I'm not a heavy Spark user, so there may be some things
> >> I've
> >> left out that I haven't hit because of how little of pyspark I'm playing
> >> with.
> >>
> >> Thanks,
> >> Julian
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>

Re: Create RDD from output of unix command

2015-07-08 Thread Richard Marscher

As a distributed data processing engine, Spark should be fine with millions
of lines. It's built with the idea of massive data sets in mind. Do you
have more details on how you anticipate the output of a unix command
interacting with a running Spark application? Do you expect Spark to be
continuously running and somehow observe unix command outputs? Or are you
thinking more along the lines of running a unix command with output and
then taking whatever format that is and running a spark job against it? If
it's the latter, it should be as simple as writing the command output to a
file and then loading the file into an RDD in Spark.

On Wed, Jul 8, 2015 at 2:02 PM, foobar  wrote:

> What's the best practice of creating RDD from some external unix command
> output? I assume if the output size is large (say millions of lines),
> creating RDD from an array of all lines is not a good idea? Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Create-RDD-from-output-of-unix-command-tp23723.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

-- 
-- 
*Richard Marscher*
Software Engineer
Localytics
Localytics.com  | Our Blog
 | Twitter  |
Facebook  | LinkedIn

RDD saveAsTextFile() to local disk

2015-07-08 Thread spok20nn

Getting exception when wrting RDD to local disk using following function

 saveAsTextFile("file:home/someuser/dir2/testupload/20150708/") 

The dir (/home/someuser/dir2/testupload/) was created before running the
job. The error message is misleading. 


org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
(TID 6, xxx.yyy.com): org.apache.hadoop.fs.ParentNotDirectoryException:
Parent path is not a directory: file:/home/someuser/dir2
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:418)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:588)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:439)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1051)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-saveAsTextFile-to-local-disk-tp23725.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Richard Marscher

Ah, I see this is streaming. I haven't any practical experience with that
side of Spark. But the foreachPartition idea is a good approach. I've used
that pattern extensively, even though not for singletons, but just to
create non-serializable objects like API and DB clients on the executor
side. I think it's the most straightforward approach to dealing with any
non-serializable object you need.

I don't entirely follow what over-network data shuffling effects you are
alluding to (maybe more specific to streaming?).

On Wed, Jul 8, 2015 at 9:41 AM, Dmitry Goldenberg 
wrote:

> My singletons do in fact stick around. They're one per worker, looks
> like.  So with 4 workers running on the box, we're creating one singleton
> per worker process/jvm, which seems OK.
>
> Still curious about foreachPartition vs. foreachRDD though...
>
> On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher <
> rmarsc...@localytics.com> wrote:
>
>> Would it be possible to have a wrapper class that just represents a
>> reference to a singleton holding the 3rd party object? It could proxy over
>> calls to the singleton object which will instantiate a private instance of
>> the 3rd party object lazily? I think something like this might work if the
>> workers have the singleton object in their classpath.
>>
>> here's a rough sketch of what I was thinking:
>>
>> object ThirdPartySingleton {
>>   private lazy val thirdPartyObj = ...
>>
>>   def someProxyFunction() = thirdPartyObj.()
>> }
>>
>> class ThirdPartyReference extends Serializable {
>>   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
>> }
>>
>> also found this SO post:
>> http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers
>>
>>
>> On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg 
>> wrote:
>>
>>> Hi,
>>>
>>> I am seeing a lot of posts on singletons vs. broadcast variables, such as
>>> *
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
>>> *
>>>
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219
>>>
>>> What's the best approach to instantiate an object once and have it be
>>> reused
>>> by the worker(s).
>>>
>>> E.g. I have an object that loads some static state such as e.g. a
>>> dictionary/map, is a part of 3rd party API and is not serializable.  I
>>> can't
>>> seem to get it to be a singleton on the worker side as the JVM appears
>>> to be
>>> wiped on every request so I get a new instance.  So the singleton doesn't
>>> stick.
>>>
>>> Is there an approach where I could have this object or a wrapper of it
>>> be a
>>> broadcast var? Can Kryo get me there? would that basically mean writing a
>>> custom serializer?  However, the 3rd party object may have a bunch of
>>> member
>>> vars hanging off it, so serializing it properly may be non-trivial...
>>>
>>> Any pointers/hints greatly appreciated.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

RDD saveAsTextFile() to local disk

2015-07-08 Thread Vijay Pawnarkar

Getting exception when wrting RDD to local disk using following function

 saveAsTextFile("file:home/someuser/dir2/testupload/20150708/")

The dir (/home/someuser/dir2/testupload/) was created before running the
job. The error message is misleading.


org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 6, xxx.yyy.com): org.apache.hadoop.fs.ParentNotDirectoryException:
Parent path is not a directory: file:/home/someuser/dir2
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:418)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:426)
at
org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:588)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:439)
at
org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:426)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:906)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:799)
at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:123)
at
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1060)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:1051)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

-- 
-Vijay

Job completed successfully without processing anything

2015-07-08 Thread ๏̯͡๏

My job completed in 40 seconds that is not correct as there is no output..

I seee
Exception in thread "main" akka.actor.ActorNotFound: Actor not found for:
ActorSelection[Anchor(akka.tcp://sparkDriver@10.115.86.24:54737/),
Path(/user/OutputCommitCoordinator)]
at
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
at
akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at
akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
at
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
at
akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
at
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
at
akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
at akka.remote.EndpointManager$$anonfun$1.applyOrElse(Remoting.scala:575)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/07/08 12:27:31 INFO storage.DiskBlockManager: Shutdown hook called
15/07/08 12:27:31 INFO util.Utils: Shutdown hook called


-- 
Deepak

pause and resume streaming app

2015-07-08 Thread Shushant Arora

Is it possible to pause and resume a streaming app?

I have a streaming app which reads events from kafka and post to some
external source. I want to pause the app when external source is down and
 resume it automatically when it comes back ?

Is it possible to pause the app and is it possible to pause only processing
part let receiver work and brong data in memory+disk, once external source
is up process whole data in one go.

Thanks
Shushant

Re: spark benchmarking

2015-07-08 Thread Stephen Boesch

One  option is the databricks/spark-perf project
https://github.com/databricks/spark-perf

2015-07-08 11:23 GMT-07:00 MrAsanjar . :

> Hi all,
> What is the most common used tool/product to benchmark spark job?
>

Communication between driver, cluster and HiveServer

2015-07-08 Thread Eric Pederson

All:

I recently ran into a scenario where spark-shell could communicate with
Hive but another application of mine (Spark Notebook) could not.  When I
tried to get a reference to a table using sql.table("tab") Spark Notebook
threw an exception but spark-shell did not.

I was trying to track down the difference between the two applications and
was having a hard time figuring out what it was.

The problem was resolved by tweaking a hive-site.xml security setting, but
I'm still curious about how it works.

It seems like spark-shell knows how to look at
$SPARK_HOME/conf/hive-site.xml and communicate with the HiveServer
directly.  But my other application doesn't know anything about
hive-site.xml and must communicate with another piece of Spark to get the
information.  Originally this indirect communication didn't work, but after
the tweak to hive-site.xml it does.

How does the communication between the driver and Hive work?  And is
spark-shell somehow special in this regard?

Thanks,

-- Eric

spark benchmarking

2015-07-08 Thread MrAsanjar .

Hi all,
What is the most common used tool/product to benchmark spark job?

Re: spark core/streaming doubts

2015-07-08 Thread Tathagata Das

Responses inline.

On Wed, Jul 8, 2015 at 10:26 AM, Shushant Arora 
wrote:

> 1.Does creation of read only singleton object in each map function is same
> as broadcast object as singleton never gets garbage collected unless
> executor gets shutdown ? Aim is to avoid creation of complex object at each
> batch interval of a spark streaming app.
>
> No, objects created in a map function are transient objects in the
executor, which gets GCed as long as you dont set up permanent references
to those objects (through singletons and statics) that prevent GC.


>
> 2.why JavaStreamingContext 's sc () method is deprecated? Whats the other
> way to access spark context to broadcast a variable then?
> jssc.sc().broadcast(filter);.
>
> jssc.sparkContext()


> 3.Does in streamapp processing executors (executors other than
> receivers)stay 24*7 till streaming app is alive?
> And task are allocated in threads on these executors?
>
> Executors stay up as long as the SparkContext is not stopped. This is true
for any Spark application, not just Spark Streaming applications.
But executors can fail and can get restarted. So its not correct to rely on
24/7 availability. So you have to plan for faults and recovery if you have
to do some custom stateful stuff.



> Thanks
> Shushant
>
>
>

Re: SnappyCompressionCodec on the master

2015-07-08 Thread Josh Rosen

Can you file a JIRA?  https://issues.apache.org/jira/browse/SPARK

On Wed, Jul 8, 2015 at 12:47 AM, nizang  wrote:

> hi,
>
> I'm running spark standalone cluster (1.4.0). I have some applications
> running with scheduler every hour. I found that on one of the executions,
> the job got to be FINISHED after very few seconds (instead of ~5 minutes),
> and in the logs on the master, I can see the following exception:
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in
> stage 1.0 failed 4 times, most recent failure: Lost task 1.3 in stage 1.0
> (TID 20, 172.31.6.203): java.io.IOException:
> java.lang.reflect.InvocationTargetException
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1257)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:59)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
> at
>
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
>
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
>
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
> at
>
> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
> at
> org.apache.spark.broadcast.TorrentBroadcast.org
> $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
> at
>
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:167)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1254)
> ... 11 more
> Caused by: java.lang.IllegalArgumentException
> at
>
> org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:152)
> ... 20 more
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
> at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> This job was successful many times before and after this run, and other
> jobs
> were successful in that time
>
> Any idea what can cause that?
>
> thanks, nizan
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SnappyCompressionCodec-on-the-master-tp23711.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: [SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-08 Thread Michael Armbrust

Here's a related JIRA: https://issues.apache.org/jira/browse/SPARK-7819


Typically you can work around this by making sure that the classes are
shared across the isolation boundary, as discussed in the comments.

On Tue, Jul 7, 2015 at 3:29 AM, Sea <261810...@qq.com> wrote:

> Hi, all
> I found an Exception when using spark-sql
> java.lang.UnsatisfiedLinkError: Native Library
> /data/lib/native/libgplcompression.so already loaded in another classloader
> ...
>
> I set  spark.sql.hive.metastore.jars=.  in file spark-defaults.conf
>
> It does not happen every time. Who knows why?
>
> Spark version: 1.4.0
> Hadoop version: 2.2.0
>
>

Create RDD from output of unix command

2015-07-08 Thread foobar

What's the best practice of creating RDD from some external unix command
output? I assume if the output size is large (say millions of lines),
creating RDD from an array of all lines is not a good idea? Thanks! 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Create-RDD-from-output-of-unix-command-tp23723.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das

Hey Jong,

No I did answer the right question. What I explained did not change the JVM
classes (that is the function is the same) but it still ensures that
computation is different (the filters get updated with time). So you can
checkpoint this and recover from it. This is ONE possible way to do
dynamically changing logic within the constraints of checkpointing. Since
there was no further idea provided on the kind of "dynamicity" in logic you
are interested in, I gave ONE possible way to do it. But I agree that i
should tied the loose ends by confirming that this should work with
checkpointing as the JVM class representing the function does not need to
change for changing the logic.

Now lets address your case of log transformer which extract field of logs
from Kafka stream. Could you provide some pseudocode on the kind of change
you would want to see? I want to learn more on what kind of dynamicity
people want. I am aware of this limitation I want to address this in
future, but for that I want to understand the requirements.

BTW, your workaround is a pretty good workaround.

TD


On Wed, Jul 8, 2015 at 10:38 AM, Jong Wook Kim  wrote:

> Hi TD, you answered a wrong question. If you read the subject, mine was
> specifically about checkpointing. I'll elaborate
>
> The checkpoint, which is a serialized DStream DAG, contains all the
> metadata and *logic*, like the function passed to e.g. DStream.transform()
>
> This is serialized as a anonymous inner class at the JVM level, and will
> not tolerate the slightest logic change, because the class signature will
> change and cannot deserialize from the checkpoint which contains the
> serialized from the previous version.
>
> Logic changes are extremely common in stream processing. Say I have a log
> transformer which extracts certain fields of logs from a Kafka stream and I
> want to add another field to extract. This involves dstream logic changes,
> thus cannot be done using checkpoint, I can't even achieve at-least-once
> guarantee.
>
> My current workaround is to read current offsets by casting to
> HasOffsetRanges
> 
>  and
> saving them to ZooKeeper, and give fromOffsets parameter read from
> ZooKeeper when creating a directStream. I've settled down to this approach
> for now, but I want to know how makers of Spark Streaming think about this
> drawback of checkpointing.
>
> If anyone had similar experience, suggestions will be appreciated.
>
> Jong Wook
>
>
>
> On 9 July 2015 at 02:13, Tathagata Das  wrote:
>
>> You can use DStream.transform for some stuff. Transform takes a RDD =>
>> RDD function that allow arbitrary RDD operations to be done on RDDs of a
>> DStream. This function gets evaluated on the driver on every batch
>> interval. If you are smart about writing the function, it can do different
>> stuff at different intervals. For example, you can always use a
>> continuously updated set of filters
>>
>> dstream.transform { rdd =>
>>val broadcastedFilters = Filters.getLatest()
>>val newRDD  = rdd.filter { x => broadcastedFilters.get.filter(x) }
>>newRDD
>> }
>>
>>
>> The function Filters.getLatest() will return the latest set of filters
>> that is broadcasted out, and as the transform function is processed in
>> every batch interval, it will always use the latest filters.
>>
>> HTH.
>>
>> TD
>>
>> On Wed, Jul 8, 2015 at 10:02 AM, Jong Wook Kim  wrote:
>>
>>> I just asked this question at the streaming webinar that just ended, but
>>> the speakers didn't answered so throwing here:
>>>
>>> AFAIK checkpoints are the only recommended method for running Spark
>>> streaming without data loss. But it involves serializing the entire dstream
>>> graph, which prohibits any logic changes. How should I update / fix logic
>>> of a running streaming app without any data loss?
>>>
>>> Jong Wook
>>>
>>
>>
>

Re: Reading Avro files from Streaming

2015-07-08 Thread harris

Resolved that compilation issue using AvroKey and AvroKeyInputFormat.

 val avroDs = ssc.fileStream[AvroKey[GenericRecord], NullWritable,
  AvroKeyInputFormat[GenericRecord]](input)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-Avro-files-from-Streaming-tp23709p23722.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim

Hi TD, you answered a wrong question. If you read the subject, mine was
specifically about checkpointing. I'll elaborate

The checkpoint, which is a serialized DStream DAG, contains all the
metadata and *logic*, like the function passed to e.g. DStream.transform()

This is serialized as a anonymous inner class at the JVM level, and will
not tolerate the slightest logic change, because the class signature will
change and cannot deserialize from the checkpoint which contains the
serialized from the previous version.

Logic changes are extremely common in stream processing. Say I have a log
transformer which extracts certain fields of logs from a Kafka stream and I
want to add another field to extract. This involves dstream logic changes,
thus cannot be done using checkpoint, I can't even achieve at-least-once
guarantee.

My current workaround is to read current offsets by casting to
HasOffsetRanges

and
saving them to ZooKeeper, and give fromOffsets parameter read from
ZooKeeper when creating a directStream. I've settled down to this approach
for now, but I want to know how makers of Spark Streaming think about this
drawback of checkpointing.

If anyone had similar experience, suggestions will be appreciated.

Jong Wook

On 9 July 2015 at 02:13, Tathagata Das  wrote:

> You can use DStream.transform for some stuff. Transform takes a RDD => RDD
> function that allow arbitrary RDD operations to be done on RDDs of a
> DStream. This function gets evaluated on the driver on every batch
> interval. If you are smart about writing the function, it can do different
> stuff at different intervals. For example, you can always use a
> continuously updated set of filters
>
> dstream.transform { rdd =>
>val broadcastedFilters = Filters.getLatest()
>val newRDD  = rdd.filter { x => broadcastedFilters.get.filter(x) }
>newRDD
> }
>
>
> The function Filters.getLatest() will return the latest set of filters
> that is broadcasted out, and as the transform function is processed in
> every batch interval, it will always use the latest filters.
>
> HTH.
>
> TD
>
> On Wed, Jul 8, 2015 at 10:02 AM, Jong Wook Kim  wrote:
>
>> I just asked this question at the streaming webinar that just ended, but
>> the speakers didn't answered so throwing here:
>>
>> AFAIK checkpoints are the only recommended method for running Spark
>> streaming without data loss. But it involves serializing the entire dstream
>> graph, which prohibits any logic changes. How should I update / fix logic
>> of a running streaming app without any data loss?
>>
>> Jong Wook
>>
>
>

Re: PySpark without PySpark

2015-07-08 Thread Davies Liu

Great post, thanks for sharing with us!

On Wed, Jul 8, 2015 at 9:59 AM, Sujit Pal  wrote:
> Hi Julian,
>
> I recently built a Python+Spark application to do search relevance
> analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on
> EC2 (so I don't use the PySpark shell, hopefully thats what you are looking
> for). Can't share the code, but the basic approach is covered in this blog
> post - scroll down to the section "Writing a Spark Application".
>
> https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python
>
> Hope this helps,
>
> -sujit
>
>
> On Wed, Jul 8, 2015 at 7:46 AM, Julian  wrote:
>>
>> Hey.
>>
>> Is there a resource that has written up what the necessary steps are for
>> running PySpark without using the PySpark shell?
>>
>> I can reverse engineer (by following the tracebacks and reading the shell
>> source) what the relevant Java imports needed are, but I would assume
>> someone has attempted this before and just published something I can
>> either
>> follow or install? If not, I have something that pretty much works and can
>> publish it, but I'm not a heavy Spark user, so there may be some things
>> I've
>> left out that I haven't hit because of how little of pyspark I'm playing
>> with.
>>
>> Thanks,
>> Julian
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

spark core/streaming doubts

2015-07-08 Thread Shushant Arora

1.Does creation of read only singleton object in each map function is same
as broadcast object as singleton never gets garbage collected unless
executor gets shutdown ? Aim is to avoid creation of complex object at each
batch interval of a spark streaming app.


2.why JavaStreamingContext 's sc () method is deprecated? Whats the other
way to access spark context to broadcast a variable then?
jssc.sc().broadcast(filter);.

3.Does in streamapp processing executors (executors other than
receivers)stay 24*7 till streaming app is alive?
And task are allocated in threads on these executors?

Thanks
Shushant

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Feynman Liang

Take a look at

https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse

On Wed, Jul 8, 2015 at 7:47 AM, Daniel Siegmann  wrote:

> To set up Eclipse for Spark you should install the Scala IDE plugins:
> http://scala-ide.org/download/current.html
>
> Define your project in Maven with Scala plugins configured (you should be
> able to find documentation online) and import as an existing Maven project.
> The source code should be in src/main/scala but otherwise the project
> structure will be the same as you'd expect in Java.
>
> Nothing special is needed for Spark. Just define the desired Spark jars (
> spark-core and possibly others, such as spark-sql) in your Maven POM as
> dependencies. You should scope these dependencies as provided, since they
> will automatically be on the classpath when you deploy your project to a
> Spark cluster.
>
> One thing to keep in mind is that Scala dependencies require separate jars
> for different versions of Scala, and it is convention to append the Scala
> version to the artifact ID. For example, if you are using Scala 2.11.x,
> your dependency will be spark-core_2.11 (look on search.maven.org if
> you're not sure). I think you can omit the Scala version if you're using
> SBT (not sure why you would, but some people seem to prefer it).
>
> Unit testing Spark is briefly explained in the programming guide
> 
> .
>
> To deploy using spark-submit you can build the jar using mvn package if
> and only if you don't have any non-Spark dependencies. Otherwise, the
> simplest thing is to build a jar with dependencies (typically using the
> assembly
> 
> or shade  plugins).
>
>
>
>
>
> On Wed, Jul 8, 2015 at 9:38 AM, Prateek .  wrote:
>
>>  Hi
>>
>>
>>
>> I am beginner to scala and spark. I am trying to set up eclipse
>> environment to develop spark program  in scala, then take it’s  jar  for
>> spark-submit.
>>
>> How shall I start? To start my  task includes, setting up eclipse for
>> scala and spark, getting dependencies resolved, building project using
>> maven/sbt.
>>
>> Is there any good blog or documentation that is can follow.
>>
>>
>>
>> Thanks
>>
>>  "DISCLAIMER: This message is proprietary to Aricent and is intended
>> solely for the use of the individual to whom it is addressed. It may
>> contain privileged or confidential information and should not be circulated
>> or used for any purpose other than for what it is intended. If you have
>> received this message in error, please notify the originator immediately.
>> If you are not the intended recipient, you are notified that you are
>> strictly prohibited from using, copying, altering, or disclosing the
>> contents of this message. Aricent accepts no responsibility for loss or
>> damage arising from the use of the information transmitted by this email
>> including damage from virus."
>>
>
>

Re: how to use DoubleRDDFunctions on mllib Vector?

2015-07-08 Thread Feynman Liang

A RDD[Double] is an abstraction for a large collection of doubles, possibly
distributed across multiple nodes. The DoubleRDDFunctions are there for
performing mean and variance calculations across this distributed dataset.

In contrast, a Vector is not distributed and fits on your local machine.
You would be better off computing these quantities on the Vector directly
(see mllib.clustering.GaussianMixture#vectorMean for an example of how to
compute the mean of a vector).

On Tue, Jul 7, 2015 at 8:26 PM, 诺铁  wrote:

> hi,
>
> there are some useful functions in DoubleRDDFunctions, which I can use if
> I have RDD[Double], eg, mean, variance.
>
> Vector doesn't have such methods, how can I convert Vector to RDD[Double],
> or maybe better if I can call mean directly on a Vector?
>

Re: Streaming checkpoints and logic change

2015-07-08 Thread Tathagata Das

You can use DStream.transform for some stuff. Transform takes a RDD => RDD
function that allow arbitrary RDD operations to be done on RDDs of a
DStream. This function gets evaluated on the driver on every batch
interval. If you are smart about writing the function, it can do different
stuff at different intervals. For example, you can always use a
continuously updated set of filters

dstream.transform { rdd =>
   val broadcastedFilters = Filters.getLatest()
   val newRDD  = rdd.filter { x => broadcastedFilters.get.filter(x) }
   newRDD
}

The function Filters.getLatest() will return the latest set of filters that
is broadcasted out, and as the transform function is processed in every
batch interval, it will always use the latest filters.

HTH.

TD

On Wed, Jul 8, 2015 at 10:02 AM, Jong Wook Kim  wrote:

> I just asked this question at the streaming webinar that just ended, but
> the speakers didn't answered so throwing here:
>
> AFAIK checkpoints are the only recommended method for running Spark
> streaming without data loss. But it involves serializing the entire dstream
> graph, which prohibits any logic changes. How should I update / fix logic
> of a running streaming app without any data loss?
>
> Jong Wook
>

Streaming checkpoints and logic change

2015-07-08 Thread Jong Wook Kim

I just asked this question at the streaming webinar that just ended, but
the speakers didn't answered so throwing here:

AFAIK checkpoints are the only recommended method for running Spark
streaming without data loss. But it involves serializing the entire dstream
graph, which prohibits any logic changes. How should I update / fix logic
of a running streaming app without any data loss?

Jong Wook

Re: PySpark without PySpark

2015-07-08 Thread Sujit Pal

Hi Julian,

I recently built a Python+Spark application to do search relevance
analytics. I use spark-submit to submit PySpark jobs to a Spark cluster on
EC2 (so I don't use the PySpark shell, hopefully thats what you are looking
for). Can't share the code, but the basic approach is covered in this blog
post - scroll down to the section "Writing a Spark Application".

https://districtdatalabs.silvrback.com/getting-started-with-spark-in-python

Hope this helps,

-sujit


On Wed, Jul 8, 2015 at 7:46 AM, Julian  wrote:

> Hey.
>
> Is there a resource that has written up what the necessary steps are for
> running PySpark without using the PySpark shell?
>
> I can reverse engineer (by following the tracebacks and reading the shell
> source) what the relevant Java imports needed are, but I would assume
> someone has attempted this before and just published something I can either
> follow or install? If not, I have something that pretty much works and can
> publish it, but I'm not a heavy Spark user, so there may be some things
> I've
> left out that I haven't hit because of how little of pyspark I'm playing
> with.
>
> Thanks,
> Julian
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Restarting Spark Streaming Application with new code

2015-07-08 Thread Vinoth Chandar

Thanks for the clarification, Cody!

On Mon, Jul 6, 2015 at 6:44 AM, Cody Koeninger  wrote:

> You shouldn't rely on being able to restart from a checkpoint after
> changing code, regardless of whether the change was explicitly related to
> serialization.
>
> If you are relying on checkpoints to hold state, specifically which
> offsets have been processed, that state will be lost if you can't recover
> from the checkpoint.  After restart the stream will start receiving
> messages based on the auto.offset.reset setting, either the beginning or
> the end of the kafka retention.
>
> To avoid this, save state in your own data store.
>
> On Sat, Jul 4, 2015 at 9:01 PM, Vinoth Chandar  wrote:
>
>> Hi,
>>
>> Just looking for some clarity on the below 1.4 documentation.
>>
>> >>And restarting from earlier checkpoint information of pre-upgrade code
>> cannot be done. The checkpoint information essentially contains serialized
>> Scala/Java/Python objects and trying to deserialize objects with new,
>> modified classes may lead to errors.
>>
>> Does this mean, new code cannot be deployed over the same checkpoints
>> even if there are not any serialization related changes? (in other words,
>> if the new code does not break previous checkpoint code w.r.t
>> serialization, would new deploys work?)
>>
>>
>> >>In this case, either start the upgraded app with a different checkpoint
>> directory, or delete the previous checkpoint directory.
>>
>> Assuming this applies to metadata & data checkpointing, does it mean that
>> effectively all the computed 'state' is gone? If I am reading from Kafka,
>> does the new code start receiving messages from where it left off?
>>
>> Thanks
>> Vinoth
>>
>
>

Re: (de)serialize DStream

2015-07-08 Thread Shixiong Zhu

DStream must be Serializable, it's metadata checkpointing. But you can use
KryoSerializer for data checkpointing. The data checkpointing uses
RDD.checkpoint which can be set by spark.serializer.

Best Regards,
Shixiong Zhu

2015-07-08 3:43 GMT+08:00 Chen Song :

> In Spark Streaming, when using updateStateByKey, it requires the generated
> DStream to be checkpointed.
>
> It seems that it always use JavaSerializer, no matter what I set for
> spark.serializer. Can I use KryoSerializer for checkpointing? If not, I
> assume the key and value types have to be Serializable?
>
> Chen
>

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread ayan guha

Do you have a benchmark to say running these two statements as it is will
be slower than what you suggest?
On 9 Jul 2015 01:06, "Brandon White"  wrote:

> The point of running them in parallel would be faster creation of the
> tables. Has anybody been able to efficiently parallelize something like
> this in Spark?
> On Jul 8, 2015 12:29 AM, "Akhil Das"  wrote:
>
>> Whats the point of creating them in parallel? You can multi-thread it run
>> it in parallel though.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White 
>> wrote:
>>
>>> Say I have a spark job that looks like following:
>>>
>>> def loadTable1() {
>>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>>   table1.cache().registerTempTable("table1")}
>>> def loadTable2() {
>>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>>   table2.cache().registerTempTable("table2")}
>>>
>>> def loadAllTables() {
>>>   loadTable1()
>>>   loadTable2()}
>>>
>>> loadAllTables()
>>>
>>> How do I parallelize this Spark job so that both tables are created at
>>> the same time or in parallel?
>>>
>>
>>

Re: Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov

What I seem to be don’t get is how my code ends up being on Worker node.

My understanding was that jar file, which I use to start the job should 
automatically be copied into Worker nodes and added to classpath. It seems to 
be not the case. But if my jar is not copied into Worker nodes, then how my 
code is able to run on Workers? I know that Driver serializes my functions and 
send them to Workers, but most of my functions are using some other classes 
(from the jar). It seems that along with functions code, Driver has to be able 
to copy other classes, too. Is that so?

That actually might explain, why KryoRegistrator is not being found on Worker - 
there are no functions, which use it directly, so it never copied to Workers.

Could you, please, explain of how code is end up on Worker or give me a hint 
where I can find it in the sources?

On 08 Jul 2015, at 17:40, Eugene Morozov  wrote:

> Hello.
> 
> I have an issue with CustomKryoRegistrator, which causes ClassNotFound on 
> Worker. 
> The issue is resolved if call SparkConf.setJar with path to the same jar I 
> run.
> 
> It is a workaround, but it requires to specify the same jar file twice. The 
> first time I use it to actually run the job, and second time in properties 
> file, which looks weird and unclear to as why I should do that. 
> 
> What is the reason for it? I thought the jar file has to be copied into all 
> Worker nodes (or else it’s not possible to run the job on Wokrers). Can 
> anyone shed some light on this?
> 
> Thanks
> --
> Eugene Morozov
> fathers...@list.ru
> 
> 
> 
> 

Eugene Morozov
fathers...@list.ru

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg

Thanks, Sean.

"are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing"

This is basically what I was looking for.


On Wed, Jul 8, 2015 at 11:15 AM, Sean Owen  wrote:

> @Evo There is no foreachRDD operation on RDDs; it is a method of
> DStream. It gives each RDD in the stream. RDD has a foreach, and
> foreachPartition. These give elements of an RDD. What do you mean it
> 'works' to call foreachRDD on an RDD?
>
> @Dmitry are you asking about foreach vs foreachPartition? that's quite
> different. foreachPartition does not give more parallelism but lets
> you operate on a whole batch of data at once, which is nice if you
> need to allocate some expensive resource to do the processing.
>
> On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
>  wrote:
> > "These are quite different operations. One operates on RDDs in  DStream
> and
> > one operates on partitions of an RDD. They are not alternatives."
> >
> > Sean, different operations as they are, they can certainly be used on the
> > same data set.  In that sense, they are alternatives. Code can be written
> > using one or the other which reaches the same effect - likely at a
> different
> > efficiency cost.
> >
> > The question is, what are the effects of applying one vs. the other?
> >
> > My specific scenario is, I'm streaming data out of Kafka.  I want to
> perform
> > a few transformations then apply an action which results in e.g. writing
> > this data to Solr.  According to Evo, my best bet is foreachPartition
> > because of increased parallelism (which I'd need to grok to understand
> the
> > details of what that means).
> >
> > Another scenario is, I've done a few transformations and send a result
> > somewhere, e.g. I write a message into a socket.  Let's say I have one
> > socket per a client of my streaming app and I get a host:port of that
> socket
> > as part of the message and want to send the response via that socket.  Is
> > foreachPartition still a better choice?
> >
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen  wrote:
> >>
> >> These are quite different operations. One operates on RDDs in  DStream
> and
> >> one operates on partitions of an RDD. They are not alternatives.
> >>
> >>
> >> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg 
> wrote:
> >>>
> >>> Is there a set of best practices for when to use foreachPartition vs.
> >>> foreachRDD?
> >>>
> >>> Is it generally true that using foreachPartition avoids some of the
> >>> over-network data shuffling overhead?
> >>>
> >>> When would I definitely want to use one method vs. the other?
> >>>
> >>> Thanks.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >
>

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg

Thanks, Cody. The "good boy" comment wasn't from me :)  I was the one
asking for help.



On Wed, Jul 8, 2015 at 10:52 AM, Cody Koeninger  wrote:

> Sean already answered your question.  foreachRDD and foreachPartition are
> completely different, there's nothing fuzzy or insufficient about that
> answer.  The fact that you can call foreachPartition on an rdd within the
> scope of foreachRDD should tell you that they aren't in any way comparable.
>
> I'm not sure if your rudeness ("be a good boy"...really?) is intentional
> or not.  If you're asking for help from people that are in most cases
> donating their time, I'd suggest that you'll have more success with a
> little more politeness.
>
> On Wed, Jul 8, 2015 at 9:05 AM, Evo Eftimov  wrote:
>
>> That was a) fuzzy b) insufficient – one can certainly use forach (only)
>> on DStream RDDs – it works as empirical observation
>>
>>
>>
>> As another empirical observation:
>>
>>
>>
>> For each partition results in having one instance of the lambda/closure
>> per partition when e.g. publishing to output systems like message brokers,
>> databases and file systems - that increases the level of parallelism of
>> your output processing
>>
>>
>>
>> As an architect I deal with gazillions of products and don’t have time to
>> read the source code of all of them to make up for documentation
>> deficiencies. On the other hand I believe you have been involved in writing
>> some of the code so be a good boy and either answer this question properly
>> or enhance the product documentation of that area of the system
>>
>>
>>
>> *From:* Sean Owen [mailto:so...@cloudera.com]
>> *Sent:* Wednesday, July 8, 2015 2:52 PM
>> *To:* dgoldenberg; user@spark.apache.org
>> *Subject:* Re: foreachRDD vs. forearchPartition ?
>>
>>
>>
>> These are quite different operations. One operates on RDDs in  DStream
>> and one operates on partitions of an RDD. They are not alternatives.
>>
>>
>>
>> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg 
>> wrote:
>>
>> Is there a set of best practices for when to use foreachPartition vs.
>> foreachRDD?
>>
>> Is it generally true that using foreachPartition avoids some of the
>> over-network data shuffling overhead?
>>
>> When would I definitely want to use one method vs. the other?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Evgeny Sinelnikov

Thank you, Ray,

but it is already created and almost fixed:
https://issues.apache.org/jira/browse/SPARK-8840

On Wed, Jul 8, 2015 at 4:04 PM, Sun, Rui  wrote:

> Hi, Evgeny,
>
> I reported a JIRA issue for your problem:
> https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see
> how it will be solved.
>
> Ray
>
> -Original Message-
> From: Evgeny Sinelnikov [mailto:esinelni...@griddynamics.com]
> Sent: Monday, July 6, 2015 7:27 PM
> To: huangzheng
> Cc: Apache Spark User List
> Subject: Re: [SparkR] Float type coercion with hiveContext
>
> I used spark 1.4.0 binaries from official site:
> http://spark.apache.org/downloads.html
>
> And running it on:
> * Hortonworks HDP 2.2.0.0-2041
> * with Hive 0.14
> * with disabled hooks for Application Timeline Servers (ATSHook) in
> hive-site.xml (commented hive.exec.failure.hooks, hive.exec.post.hooks,
> hive.exec.pre.hooks)
>
>
> On Mon, Jul 6, 2015 at 1:33 PM, huangzheng <1106944...@qq.com> wrote:
> >
> > Hi ， Are you used sparkR about spark 1.4 version? How do build  from
> > spark source code ?
> >
> > -- 原始邮件 --
> > 发件人: "Evgeny Sinelnikov";;
> > 发送时间: 2015年7月6日(星期一) 晚上6:31
> > 收件人: "user";
> > 主题: [SparkR] Float type coercion with hiveContext
> >
> > Hello,
> >
> > I'm got a trouble with float type coercion on SparkR with hiveContext.
> >
> >> result <- sql(hiveContext, "SELECT offset, percentage from data limit
> >> 100")
> >
> >> show(result)
> > DataFrame[offset:float, percentage:float]
> >
> >> head(result)
> > Error in as.data.frame.default(x[[i]], optional = TRUE) :
> > cannot coerce class ""jobj"" to a data.frame
> >
> >
> > This trouble looks like already exists (SPARK-2863 - Emulate Hive type
> > coercion in native reimplementations of Hive functions) with same
> > reason - not completed "native reimplementations of Hive..." not
> > "...functions" only.
> >
> > It looks like a bug.
> > So, anybody met this issue before?
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> > additional commands, e-mail: user-h...@spark.apache.org
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Jobs with unknown origin.

2015-07-08 Thread Jan-Paul Bultmann

Hey,

I have quite a few jobs appearing in the web-ui with the description "run at 
ThreadPoolExecutor.java:1142".
Are these generated by SparkSQL internally?

There are so many that they cause a RejectedExecutionException when the 
thread-pool runs out of space for them.

RejectedExecutionException Task 
scala.concurrent.impl.Future$PromiseCompletingRunnable@30ec07a4 rejected from 
java.util.concurrent.ThreadPoolExecutor@9110d1[Running, pool size = 128, active 
threads = 128, queued tasks = 0, completed tasks = 392]  
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution 
(ThreadPoolExecutor.java:2047)

Any ideas where they come from? I'm pretty sure that they don't originate from 
my code.

Cheers Jan
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Srikanth

Your tableLoad() APIs are not actions. File will be read fully only when an
action is performed.
If the action is something like table1.join(table2), then I think both
files will be read in parallel.
Can you try that and look at the execution plan or in 1.4 this is shown in
Spark UI.

Srikanth

On Wed, Jul 8, 2015 at 11:06 AM, Brandon White 
wrote:

> The point of running them in parallel would be faster creation of the
> tables. Has anybody been able to efficiently parallelize something like
> this in Spark?
> On Jul 8, 2015 12:29 AM, "Akhil Das"  wrote:
>
>> Whats the point of creating them in parallel? You can multi-thread it run
>> it in parallel though.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White 
>> wrote:
>>
>>> Say I have a spark job that looks like following:
>>>
>>> def loadTable1() {
>>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>>   table1.cache().registerTempTable("table1")}
>>> def loadTable2() {
>>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>>   table2.cache().registerTempTable("table2")}
>>>
>>> def loadAllTables() {
>>>   loadTable1()
>>>   loadTable2()}
>>>
>>> loadAllTables()
>>>
>>> How do I parallelize this Spark job so that both tables are created at
>>> the same time or in parallel?
>>>
>>
>>

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen

@Evo There is no foreachRDD operation on RDDs; it is a method of
DStream. It gives each RDD in the stream. RDD has a foreach, and
foreachPartition. These give elements of an RDD. What do you mean it
'works' to call foreachRDD on an RDD?

@Dmitry are you asking about foreach vs foreachPartition? that's quite
different. foreachPartition does not give more parallelism but lets
you operate on a whole batch of data at once, which is nice if you
need to allocate some expensive resource to do the processing.

On Wed, Jul 8, 2015 at 3:18 PM, Dmitry Goldenberg
 wrote:
> "These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives."
>
> Sean, different operations as they are, they can certainly be used on the
> same data set.  In that sense, they are alternatives. Code can be written
> using one or the other which reaches the same effect - likely at a different
> efficiency cost.
>
> The question is, what are the effects of applying one vs. the other?
>
> My specific scenario is, I'm streaming data out of Kafka.  I want to perform
> a few transformations then apply an action which results in e.g. writing
> this data to Solr.  According to Evo, my best bet is foreachPartition
> because of increased parallelism (which I'd need to grok to understand the
> details of what that means).
>
> Another scenario is, I've done a few transformations and send a result
> somewhere, e.g. I write a message into a socket.  Let's say I have one
> socket per a client of my streaming app and I get a host:port of that socket
> as part of the message and want to send the response via that socket.  Is
> foreachPartition still a better choice?
>
>
>
>
>
>
>
>
> On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen  wrote:
>>
>> These are quite different operations. One operates on RDDs in  DStream and
>> one operates on partitions of an RDD. They are not alternatives.
>>
>>
>> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg  wrote:
>>>
>>> Is there a set of best practices for when to use foreachPartition vs.
>>> foreachRDD?
>>>
>>> Is it generally true that using foreachPartition avoids some of the
>>> over-network data shuffling overhead?
>>>
>>> When would I definitely want to use one method vs. the other?
>>>
>>> Thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Parallelizing multiple RDD / DataFrame creation in Spark

2015-07-08 Thread Brandon White

The point of running them in parallel would be faster creation of the
tables. Has anybody been able to efficiently parallelize something like
this in Spark?
On Jul 8, 2015 12:29 AM, "Akhil Das"  wrote:

> Whats the point of creating them in parallel? You can multi-thread it run
> it in parallel though.
>
> Thanks
> Best Regards
>
> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White 
> wrote:
>
>> Say I have a spark job that looks like following:
>>
>> def loadTable1() {
>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>   table1.cache().registerTempTable("table1")}
>> def loadTable2() {
>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>   table2.cache().registerTempTable("table2")}
>>
>> def loadAllTables() {
>>   loadTable1()
>>   loadTable2()}
>>
>> loadAllTables()
>>
>> How do I parallelize this Spark job so that both tables are created at
>> the same time or in parallel?
>>
>
>

Re: Connecting to nodes on cluster

2015-07-08 Thread ayan guha

What's the error you are getting?
On 9 Jul 2015 00:01, "Ashish Dutt"  wrote:

> Hi,
>
> We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
> days I have been trying to connect my laptop to the server using spark
>  but its been unsucessful.
> The server contains data that needs to be cleaned and analysed.
> The cluster and the nodes are on linux environment.
> To connect to the nodes I am usnig SSH
>
> Question: Would it be better if I work directly on the nodes rather than
> trying to connect my laptop to them ?
> Question 2: If yes, then can you suggest any python and R IDE that I can
> install on the nodes to make it work?
>
> Thanks for your help
>
>
> Sincerely,
> Ashish Dutt
>
>

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Cody Koeninger

Sean already answered your question.  foreachRDD and foreachPartition are
completely different, there's nothing fuzzy or insufficient about that
answer.  The fact that you can call foreachPartition on an rdd within the
scope of foreachRDD should tell you that they aren't in any way comparable.

I'm not sure if your rudeness ("be a good boy"...really?) is intentional or
not.  If you're asking for help from people that are in most cases donating
their time, I'd suggest that you'll have more success with a little more
politeness.

On Wed, Jul 8, 2015 at 9:05 AM, Evo Eftimov  wrote:

> That was a) fuzzy b) insufficient – one can certainly use forach (only) on
> DStream RDDs – it works as empirical observation
>
>
>
> As another empirical observation:
>
>
>
> For each partition results in having one instance of the lambda/closure
> per partition when e.g. publishing to output systems like message brokers,
> databases and file systems - that increases the level of parallelism of
> your output processing
>
>
>
> As an architect I deal with gazillions of products and don’t have time to
> read the source code of all of them to make up for documentation
> deficiencies. On the other hand I believe you have been involved in writing
> some of the code so be a good boy and either answer this question properly
> or enhance the product documentation of that area of the system
>
>
>
> *From:* Sean Owen [mailto:so...@cloudera.com]
> *Sent:* Wednesday, July 8, 2015 2:52 PM
> *To:* dgoldenberg; user@spark.apache.org
> *Subject:* Re: foreachRDD vs. forearchPartition ?
>
>
>
> These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives.
>
>
>
> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg  wrote:
>
> Is there a set of best practices for when to use foreachPartition vs.
> foreachRDD?
>
> Is it generally true that using foreachPartition avoids some of the
> over-network data shuffling overhead?
>
> When would I definitely want to use one method vs. the other?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Is there a way to shutdown the derby in hive context in spark shell?

2015-07-08 Thread Terry Hole

I am using spark 1.4.1rc1 with default hive settings

Thanks
- Terry

Hi All,

I'd like to use the hive context in spark shell, i need to recreate the
hive meta database in the same location, so i want to close the derby
connection previous created in the spark shell, is there any way to do this?

I try this, but it does not work:

DriverManager.getConnection("jdbc:derby:;shutdown=true");

Thanks!

- Terry

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Daniel Siegmann

To set up Eclipse for Spark you should install the Scala IDE plugins:
http://scala-ide.org/download/current.html

Define your project in Maven with Scala plugins configured (you should be
able to find documentation online) and import as an existing Maven project.
The source code should be in src/main/scala but otherwise the project
structure will be the same as you'd expect in Java.

Nothing special is needed for Spark. Just define the desired Spark jars (
spark-core and possibly others, such as spark-sql) in your Maven POM as
dependencies. You should scope these dependencies as provided, since they
will automatically be on the classpath when you deploy your project to a
Spark cluster.

One thing to keep in mind is that Scala dependencies require separate jars
for different versions of Scala, and it is convention to append the Scala
version to the artifact ID. For example, if you are using Scala 2.11.x,
your dependency will be spark-core_2.11 (look on search.maven.org if you're
not sure). I think you can omit the Scala version if you're using SBT (not
sure why you would, but some people seem to prefer it).

Unit testing Spark is briefly explained in the programming guide
.

To deploy using spark-submit you can build the jar using mvn package if and
only if you don't have any non-Spark dependencies. Otherwise, the simplest
thing is to build a jar with dependencies (typically using the assembly
 or
shade  plugins).

On Wed, Jul 8, 2015 at 9:38 AM, Prateek .  wrote:

>  Hi
>
>
>
> I am beginner to scala and spark. I am trying to set up eclipse
> environment to develop spark program  in scala, then take it’s  jar  for
> spark-submit.
>
> How shall I start? To start my  task includes, setting up eclipse for
> scala and spark, getting dependencies resolved, building project using
> maven/sbt.
>
> Is there any good blog or documentation that is can follow.
>
>
>
> Thanks
>
>  "DISCLAIMER: This message is proprietary to Aricent and is intended
> solely for the use of the individual to whom it is addressed. It may
> contain privileged or confidential information and should not be circulated
> or used for any purpose other than for what it is intended. If you have
> received this message in error, please notify the originator immediately.
> If you are not the intended recipient, you are notified that you are
> strictly prohibited from using, copying, altering, or disclosing the
> contents of this message. Aricent accepts no responsibility for loss or
> damage arising from the use of the information transmitted by this email
> including damage from virus."
>

PySpark without PySpark

2015-07-08 Thread Julian

Hey.

Is there a resource that has written up what the necessary steps are for
running PySpark without using the PySpark shell?

I can reverse engineer (by following the tracebacks and reading the shell
source) what the relevant Java imports needed are, but I would assume
someone has attempted this before and just published something I can either
follow or install? If not, I have something that pretty much works and can
publish it, but I'm not a heavy Spark user, so there may be some things I've
left out that I haven't hit because of how little of pyspark I'm playing
with.

Thanks,
Julian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-without-PySpark-tp23719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Kryo Serializer on Worker doesn't work by default.

2015-07-08 Thread Eugene Morozov

Hello.

I have an issue with CustomKryoRegistrator, which causes ClassNotFound on 
Worker. 
The issue is resolved if call SparkConf.setJar with path to the same jar I run.

It is a workaround, but it requires to specify the same jar file twice. The 
first time I use it to actually run the job, and second time in properties 
file, which looks weird and unclear to as why I should do that. 

What is the reason for it? I thought the jar file has to be copied into all 
Worker nodes (or else it’s not possible to run the job on Wokrers). Can anyone 
shed some light on this?

Thanks
--
Eugene Morozov
fathers...@list.ru

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Dmitry Goldenberg

"These are quite different operations. One operates on RDDs in  DStream and
one operates on partitions of an RDD. They are not alternatives."

Sean, different operations as they are, they can certainly be used on the
same data set.  In that sense, they are alternatives. Code can be written
using one or the other which reaches the same effect - likely at a
different efficiency cost.

The question is, what are the effects of applying one vs. the other?

My specific scenario is, I'm streaming data out of Kafka.  I want to
perform a few transformations then apply an action which results in e.g.
writing this data to Solr.  According to Evo, my best bet is
foreachPartition because of increased parallelism (which I'd need to grok
to understand the details of what that means).

Another scenario is, I've done a few transformations and send a result
somewhere, e.g. I write a message into a socket.  Let's say I have one
socket per a client of my streaming app and I get a host:port of that
socket as part of the message and want to send the response via that
socket.  Is foreachPartition still a better choice?

On Wed, Jul 8, 2015 at 9:51 AM, Sean Owen  wrote:

> These are quite different operations. One operates on RDDs in  DStream and
> one operates on partitions of an RDD. They are not alternatives.
>
> On Wed, Jul 8, 2015, 2:43 PM dgoldenberg  wrote:
>
>> Is there a set of best practices for when to use foreachPartition vs.
>> foreachRDD?
>>
>> Is it generally true that using foreachPartition avoids some of the
>> over-network data shuffling overhead?
>>
>> When would I definitely want to use one method vs. the other?
>>
>> Thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: PySpark MLlib: py4j cannot find trainImplicitALSModel method

2015-07-08 Thread Ashish Dutt

Hello Sooraj,
Thank you for your response. It indeed give me a ray of hope now.
Can you please suggest any good tutorials for installing and working with
ipython notebook server on the node.
Thank you
Ashish
On 08-Jul-2015 6:16 PM, "sooraj"  wrote:
>
> Hi Ashish,
>
> I am running ipython notebook server on one of the nodes of the cluster
(HDP). Setting it up was quite straightforward, and I guess I followed the
same references that you linked to. Then I access the notebook remotely
from my development PC. Never tried to connect a local ipython (on a PC) to
a remote Spark cluster. Not sure if that is possible.
>
> - Sooraj
>
> On 8 July 2015 at 15:31, Ashish Dutt  wrote:
>>
>> My apologies for double posting but I missed the web links that i
followed which are 1, 2, 3
>>
>> Thanks,
>> Ashish
>>
>> On Wed, Jul 8, 2015 at 5:49 PM, sooraj  wrote:
>>>
>>> That turned out to be a silly data type mistake. At one point in the
iterative call, I was passing an integer value for the parameter 'alpha' of
the ALS train API, which was expecting a Double. So, py4j in fact
complained that it cannot take a method that takes an integer value for
that parameter.
>>>
>>> On 8 July 2015 at 12:35, sooraj  wrote:

 Hi,

 I am using MLlib collaborative filtering API on an implicit preference
data set. From a pySpark notebook, I am iteratively creating the matrix
factorization model with the aim of measuring the RMSE for each combination
of parameters for this API like the rank, lambda and alpha. After the code
successfully completed six iterations, on the seventh call of the
ALS.trainImplicit API, I get a confusing exception that says py4j cannot
find the method trainImplicitALSmodel.  The full trace is included at the
end of the email.

 I am running Spark over YARN (yarn-client mode) with five executors.
This error seems to be happening completely on the driver as I don't see
any error on the Spark web interface. I have tried changing the
spark.yarn.am.memory configuration value, but it doesn't help. Any
suggestion on how to debug this will be very helpful.

 Thank you,
 Sooraj

 Here is the full error trace:


---
 Py4JError Traceback (most recent call
last)
  in ()
   3
   4 for index, (r, l, a, i) in enumerate(itertools.product(ranks,
lambdas, alphas, iters)):
 > 5 model = ALS.trainImplicit(scoreTableTrain, rank = r,
iterations = i, lambda_ = l, alpha = a)
   6
   7 predictionsTrain = model.predictAll(userProductTrainRDD)


/usr/local/spark-1.4/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/recommendation.pyc
in trainImplicit(cls, ratings, rank, iterations, lambda_, blocks, alpha,
nonnegative, seed)
 198   nonnegative=False, seed=None):
 199 model = callMLlibFunc("trainImplicitALSModel",
cls._prepare(ratings), rank,
 --> 200   iterations, lambda_, blocks,
alpha, nonnegative, seed)
 201 return MatrixFactorizationModel(model)
 202


/usr/local/spark-1.4/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callMLlibFunc(name, *args)
 126 sc = SparkContext._active_spark_context
 127 api = getattr(sc._jvm.PythonMLLibAPI(), name)
 --> 128 return callJavaFunc(sc, api, *args)
 129
 130


/usr/local/spark-1.4/spark-1.4.0-bin-hadoop2.6/python/pyspark/mllib/common.pyc
in callJavaFunc(sc, func, *args)
 119 """ Call Java Function """
 120 args = [_py2java(sc, a) for a in args]
 --> 121 return _java2py(sc, func(*args))
 122
 123

 /usr/local/lib/python2.7/site-packages/py4j/java_gateway.pyc in
__call__(self, *args)
 536 answer = self.gateway_client.send_command(command)
 537 return_value = get_return_value(answer,
self.gateway_client,
 --> 538 self.target_id, self.name)
 539
 540 for temp_arg in temp_args:

 /usr/local/lib/python2.7/site-packages/py4j/protocol.pyc in
get_return_value(answer, gateway_client, target_id, name)
 302 raise Py4JError(
 303 'An error occurred while calling
{0}{1}{2}. Trace:\n{3}\n'.
 --> 304 format(target_id, '.', name, value))
 305 else:
 306 raise Py4JError(

 Py4JError: An error occurred while calling o667.trainImplicitALSModel.
Trace:
 py4j.Py4JException: Method trainImplicitALSModel([class
org.apache.spark.api.java.JavaRDD, class java.lang.Integer, class
java.lang.Integer, class java.lang.Integer, class java.lang.Integer, class
java.lang.Double, class java.lang.Boolean, null]) does not exist
 at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov

That was a) fuzzy b) insufficient – one can certainly use forach (only) on 
DStream RDDs – it works as empirical observation  

 

As another empirical observation:

 

For each partition results in having one instance of the lambda/closure per 
partition when e.g. publishing to output systems like message brokers, 
databases and file systems - that increases the level of parallelism of your 
output processing 

 

As an architect I deal with gazillions of products and don’t have time to read 
the source code of all of them to make up for documentation deficiencies. On 
the other hand I believe you have been involved in writing some of the code so 
be a good boy and either answer this question properly or enhance the product 
documentation of that area of the system 

 

From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Wednesday, July 8, 2015 2:52 PM
To: dgoldenberg; user@spark.apache.org
Subject: Re: foreachRDD vs. forearchPartition ?

 

These are quite different operations. One operates on RDDs in  DStream and one 
operates on partitions of an RDD. They are not alternatives. 

 

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg  wrote:

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Connecting to nodes on cluster

2015-07-08 Thread Ashish Dutt

Hi,

We have a cluster with 4 nodes. The cluster uses CDH 5.4 for the past two
days I have been trying to connect my laptop to the server using spark
 but its been unsucessful.
The server contains data that needs to be cleaned and analysed.
The cluster and the nodes are on linux environment.
To connect to the nodes I am usnig SSH

Question: Would it be better if I work directly on the nodes rather than
trying to connect my laptop to them ?
Question 2: If yes, then can you suggest any python and R IDE that I can
install on the nodes to make it work?

Thanks for your help


Sincerely,
Ashish Dutt

Re: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Sean Owen

These are quite different operations. One operates on RDDs in  DStream and
one operates on partitions of an RDD. They are not alternatives.

On Wed, Jul 8, 2015, 2:43 PM dgoldenberg  wrote:

> Is there a set of best practices for when to use foreachPartition vs.
> foreachRDD?
>
> Is it generally true that using foreachPartition avoids some of the
> over-network data shuffling overhead?
>
> When would I definitely want to use one method vs. the other?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

RE: foreachRDD vs. forearchPartition ?

2015-07-08 Thread Evo Eftimov

For each partition results in having one instance of the lambda/closure per
partition when e.g. publishing to output systems like message brokers,
databases and file systems - that increases the level of parallelism of your
output processing 

-Original Message-
From: dgoldenberg [mailto:dgoldenberg...@gmail.com] 
Sent: Wednesday, July 8, 2015 2:43 PM
To: user@spark.apache.org
Subject: foreachRDD vs. forearchPartition ?

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.

--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPa
rtition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Ashish Dutt

Hello Prateek,
I started with getting the pre built binaries so as to skip the hassle of
building them from scratch.
I am not familiar with scala so can't comment on it.
I have documented my experiences on my blog www.edumine.wordpress.com
Perhaps it might be useful to you.
 On 08-Jul-2015 9:39 PM, "Prateek ."  wrote:

>  Hi
>
>
>
> I am beginner to scala and spark. I am trying to set up eclipse
> environment to develop spark program  in scala, then take it’s  jar  for
> spark-submit.
>
> How shall I start? To start my  task includes, setting up eclipse for
> scala and spark, getting dependencies resolved, building project using
> maven/sbt.
>
> Is there any good blog or documentation that is can follow.
>
>
>
> Thanks
>
>  "DISCLAIMER: This message is proprietary to Aricent and is intended
> solely for the use of the individual to whom it is addressed. It may
> contain privileged or confidential information and should not be circulated
> or used for any purpose other than for what it is intended. If you have
> received this message in error, please notify the originator immediately.
> If you are not the intended recipient, you are notified that you are
> strictly prohibited from using, copying, altering, or disclosing the
> contents of this message. Aricent accepts no responsibility for loss or
> damage arising from the use of the information transmitted by this email
> including damage from virus."
>

foreachRDD vs. forearchPartition ?

2015-07-08 Thread dgoldenberg

Is there a set of best practices for when to use foreachPartition vs.
foreachRDD?

Is it generally true that using foreachPartition avoids some of the
over-network data shuffling overhead?

When would I definitely want to use one method vs. the other?

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/foreachRDD-vs-forearchPartition-tp23714.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg

My singletons do in fact stick around. They're one per worker, looks like.
So with 4 workers running on the box, we're creating one singleton per
worker process/jvm, which seems OK.

Still curious about foreachPartition vs. foreachRDD though...

On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher 
wrote:

> Would it be possible to have a wrapper class that just represents a
> reference to a singleton holding the 3rd party object? It could proxy over
> calls to the singleton object which will instantiate a private instance of
> the 3rd party object lazily? I think something like this might work if the
> workers have the singleton object in their classpath.
>
> here's a rough sketch of what I was thinking:
>
> object ThirdPartySingleton {
>   private lazy val thirdPartyObj = ...
>
>   def someProxyFunction() = thirdPartyObj.()
> }
>
> class ThirdPartyReference extends Serializable {
>   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
> }
>
> also found this SO post:
> http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers
>
>
> On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg 
> wrote:
>
>> Hi,
>>
>> I am seeing a lot of posts on singletons vs. broadcast variables, such as
>> *
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
>> *
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219
>>
>> What's the best approach to instantiate an object once and have it be
>> reused
>> by the worker(s).
>>
>> E.g. I have an object that loads some static state such as e.g. a
>> dictionary/map, is a part of 3rd party API and is not serializable.  I
>> can't
>> seem to get it to be a singleton on the worker side as the JVM appears to
>> be
>> wiped on every request so I get a new instance.  So the singleton doesn't
>> stick.
>>
>> Is there an approach where I could have this object or a wrapper of it be
>> a
>> broadcast var? Can Kryo get me there? would that basically mean writing a
>> custom serializer?  However, the 3rd party object may have a bunch of
>> member
>> vars hanging off it, so serializing it properly may be non-trivial...
>>
>> Any pointers/hints greatly appreciated.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Getting started with spark-scala developemnt in eclipse.

2015-07-08 Thread Prateek .

Hi

I am beginner to scala and spark. I am trying to set up eclipse environment to 
develop spark program  in scala, then take it's  jar  for spark-submit.
How shall I start? To start my  task includes, setting up eclipse for scala and 
spark, getting dependencies resolved, building project using maven/sbt.
Is there any good blog or documentation that is can follow.

Thanks
"DISCLAIMER: This message is proprietary to Aricent and is intended solely for 
the use of the individual to whom it is addressed. It may contain privileged or 
confidential information and should not be circulated or used for any purpose 
other than for what it is intended. If you have received this message in error, 
please notify the originator immediately. If you are not the intended 
recipient, you are notified that you are strictly prohibited from using, 
copying, altering, or disclosing the contents of this message. Aricent accepts 
no responsibility for loss or damage arising from the use of the information 
transmitted by this email including damage from virus."

Re: Best practice for using singletons on workers (seems unanswered) ?

2015-07-08 Thread Dmitry Goldenberg

Richard,

That's exactly the strategy I've been trying, which is a wrapper singleton
class. But I was seeing the inner object being created multiple times.

I wonder if the problem has to do with the way I'm processing the RDD's.
I'm using JavaDStream to stream data (from Kafka). Then I'm processing the
RDD's like so

JavaPairInputDStream messages =
KafkaUtils.createDirectStream(...)
JavaDStream messageBodies = messages.map(...)
messageBodies.foreachRDD(new MyFunction());

where MyFunction implements Function, Void> {
  ...
  rdd.map / rdd.filter ...
  rdd.foreach(... perform final action ...)
}

Perhaps the multiple singletons I'm seeing are the per-executor instances?
Judging by the streaming programming guide, perhaps I should follow the
connection sharing example:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
val connection = createNewConnection()
partitionOfRecords.foreach(record => connection.send(record))
connection.close()
  }
}

So I'd pre-create my singletons in the foreachPartition call which would
let them be the per-JVM singletons, to be passed into MyFunction which
would now be a partition processing function rather than an RDD processing
function.

I wonder whether these singletons would still be created on every call as
the master sends RDD data over to the workers ?

I also wonder whether using foreachPartition would be more efficient anyway
and prevent some of the over-network data shuffling effects that I imagine
may happen with just doing a foreachRDD ?











On Tue, Jul 7, 2015 at 11:27 AM, Richard Marscher 
wrote:

> Would it be possible to have a wrapper class that just represents a
> reference to a singleton holding the 3rd party object? It could proxy over
> calls to the singleton object which will instantiate a private instance of
> the 3rd party object lazily? I think something like this might work if the
> workers have the singleton object in their classpath.
>
> here's a rough sketch of what I was thinking:
>
> object ThirdPartySingleton {
>   private lazy val thirdPartyObj = ...
>
>   def someProxyFunction() = thirdPartyObj.()
> }
>
> class ThirdPartyReference extends Serializable {
>   def someProxyFunction() = ThirdPartySingleton.someProxyFunction()
> }
>
> also found this SO post:
> http://stackoverflow.com/questions/26369916/what-is-the-right-way-to-have-a-static-object-on-all-workers
>
>
> On Tue, Jul 7, 2015 at 11:04 AM, dgoldenberg 
> wrote:
>
>> Hi,
>>
>> I am seeing a lot of posts on singletons vs. broadcast variables, such as
>> *
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-have-some-singleton-per-worker-tt20277.html
>> *
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-share-a-NonSerializable-variable-among-tasks-in-the-same-worker-node-tt11048.html#a21219
>>
>> What's the best approach to instantiate an object once and have it be
>> reused
>> by the worker(s).
>>
>> E.g. I have an object that loads some static state such as e.g. a
>> dictionary/map, is a part of 3rd party API and is not serializable.  I
>> can't
>> seem to get it to be a singleton on the worker side as the JVM appears to
>> be
>> wiped on every request so I get a new instance.  So the singleton doesn't
>> stick.
>>
>> Is there an approach where I could have this object or a wrapper of it be
>> a
>> broadcast var? Can Kryo get me there? would that basically mean writing a
>> custom serializer?  However, the 3rd party object may have a bunch of
>> member
>> vars hanging off it, so serializing it properly may be non-trivial...
>>
>> Any pointers/hints greatly appreciated.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-using-singletons-on-workers-seems-unanswered-tp23692.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

RE: [SparkR] Float type coercion with hiveContext

2015-07-08 Thread Sun, Rui

Hi, Evgeny,

I reported a JIRA issue for your problem: 
https://issues.apache.org/jira/browse/SPARK-8897. You can track it to see how 
it will be solved.

Ray

-Original Message-
From: Evgeny Sinelnikov [mailto:esinelni...@griddynamics.com] 
Sent: Monday, July 6, 2015 7:27 PM
To: huangzheng
Cc: Apache Spark User List
Subject: Re: [SparkR] Float type coercion with hiveContext

I used spark 1.4.0 binaries from official site:
http://spark.apache.org/downloads.html

And running it on:
* Hortonworks HDP 2.2.0.0-2041
* with Hive 0.14
* with disabled hooks for Application Timeline Servers (ATSHook) in 
hive-site.xml (commented hive.exec.failure.hooks, hive.exec.post.hooks, 
hive.exec.pre.hooks)


On Mon, Jul 6, 2015 at 1:33 PM, huangzheng <1106944...@qq.com> wrote:
>
> Hi ， Are you used sparkR about spark 1.4 version? How do build  from 
> spark source code ?
>
> -- 原始邮件 --
> 发件人: "Evgeny Sinelnikov";;
> 发送时间: 2015年7月6日(星期一) 晚上6:31
> 收件人: "user";
> 主题: [SparkR] Float type coercion with hiveContext
>
> Hello,
>
> I'm got a trouble with float type coercion on SparkR with hiveContext.
>
>> result <- sql(hiveContext, "SELECT offset, percentage from data limit
>> 100")
>
>> show(result)
> DataFrame[offset:float, percentage:float]
>
>> head(result)
> Error in as.data.frame.default(x[[i]], optional = TRUE) :
> cannot coerce class ""jobj"" to a data.frame
>
>
> This trouble looks like already exists (SPARK-2863 - Emulate Hive type 
> coercion in native reimplementations of Hive functions) with same 
> reason - not completed "native reimplementations of Hive..." not 
> "...functions" only.
>
> It looks like a bug.
> So, anybody met this issue before?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov

Also try to increase the number of partions gradually – not in one big jump 
from 20 to 100 but adding e.g. 10 at a time and see whether there is a 
correlation with adding more RAM to the executors 

From: Evo Eftimov [mailto:evo.efti...@isecc.com] 
Sent: Wednesday, July 8, 2015 1:26 PM
To: 'Aniruddh Sharma'; 'user@spark.apache.org'
Subject: RE: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

Are you sure you have actually increased the RAM (how exactly did you do that 
and does it show in Spark UI)

Also use the SPARK UI and the driver console  to check the RAM allocated for 
each RDD and RDD partion in each of the scenarios  

Re b) the general rule is num of partitions = 2 x num of CPU cores

All partitions are operated in parallel (by independently running JVM Threads), 
however if you have substantially higher num of partitions (JVM Threads) than 
num of core then you will get what happens in any JVM or OS – there will be 
switching between the Threads and some of them will be in a suspended mode 
waiting for free core (Thread contexts also occupy additional RAM )

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:52 PM
To: Evo Eftimov
Subject: Re: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

Thanks for your revert...

I increased executor memory from 4GB to 35 GB and still out of memory error 
happens. So it seems it may not be entirely due to more buffers due to more 
partitions.

Query a) Is there a way to debug at more granular level from user code 
perspective where things could go wrong.

Query b) 

In general my query is lets suppose it is not ALS (or some iterative 
algorithm). Lets say it is some sample RDD but which 1 partitions and each 
executor has 50 partitions and each machine has 4 physical cores.So do 4 
physical cores parallely try to process these 50 partitions (doing 
multitasking) or will it work in a way that 4 cores will first process first 4 
partitions and then next 4 partitions and so on. 

Thanks and Regards

Aniruddh

On Wed, Jul 8, 2015 at 5:09 PM, Evo Eftimov  wrote:

This is most likely due to the internal implementation of ALS in MLib. Probably 
for each parallel unit of execution (partition in Spark terms) the 
implementation allocates and uses a RAM buffer where it keeps interim results 
during the ALS iterations

If we assume that the size of that internal RAM buffer is fixed per Unit of 
Execution then Total RAM (20 partitions x fixed RAM buffer) < Total RAM (100 
partitions x fixed RAM buffer) 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:22 PM
To: user@spark.apache.org
Subject: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

Hi,

I am new to Spark. I have done following tests and I am confused in 
conclusions. I have 2 queries.

Following is the detail of test

Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical 
cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 
executors and my Rating data set has 20 partitions. It works. In order to 
increase parallelism, I did 100 partitions instead of 20 and now program does 
not work and it throws out of memory error.

Query a): As I had 4 cores on each machine , but my number of partitions are 10 
in each executor and my cores are not sufficient for partitions. Is it supposed 
to give memory errors when this kind of misconfiguration.If there are not 
sufficient cores and processing cannot be done in parallel, can different 
partitions not be processed sequentially and operation could have become slow 
rather than throwing memory error.

Query b)  If it gives error, then error message is not meaningful Here my DAG 
was very simple and I could trace that lowering number of partitions is 
working, but if on misconfiguration of cores it throws error, then how to debug 
it in complex DAGs as error does not tell explicitly that problem could be due 
to low number of cores. If my understanding is incorrect, then kindly explain 
the reasons of error in this case

Thanks and Regards

Aniruddh

Announcement of the webinar in the newsletter and on the site

2015-07-08 Thread Oleh Rozvadovskyy

Hi there,

My name is Oleh Rozvadovskyy. I represent CyberVision Inc., the IoT company
and the developer of Kaa IoT platform, which is open-source middleware for
smart devices and servers. In a 2 weeks period we're going to run a
webinar  *"IoT data ingestion in Spark Streaming using Kaa" on Thu, Jul 23,
2015 2:00 PM - 3:00 PM EDT *, where we will build a solution that ingests
real-time data from various devices and sensors into Apache Spark for
stream processing. (
https://attendee.gotowebinar.com/register/315936237679516161

).

Is it possible to make an announcement of our webinar in your newsletter
and website? How can we do it?


Looking forward for your reply!

Best wishes,
Oleh Rozvadovskyy
CyberVision Inc.
skype: ole-gee

RE: Out of Memory Errors on less number of cores in proportion to Partitions in Data

2015-07-08 Thread Evo Eftimov

Are you sure you have actually increased the RAM (how exactly did you do that 
and does it show in Spark UI)

 

Also use the SPARK UI and the driver console  to check the RAM allocated for 
each RDD and RDD partion in each of the scenarios  

 

Re b) the general rule is num of partitions = 2 x num of CPU cores

 

All partitions are operated in parallel (by independently running JVM Threads), 
however if you have substantially higher num of partitions (JVM Threads) than 
num of core then you will get what happens in any JVM or OS – there will be 
switching between the Threads and some of them will be in a suspended mode 
waiting for free core (Thread contexts also occupy additional RAM )

 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:52 PM
To: Evo Eftimov
Subject: Re: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Thanks for your revert...

I increased executor memory from 4GB to 35 GB and still out of memory error 
happens. So it seems it may not be entirely due to more buffers due to more 
partitions.

Query a) Is there a way to debug at more granular level from user code 
perspective where things could go wrong.

 

Query b) 

In general my query is lets suppose it is not ALS (or some iterative 
algorithm). Lets say it is some sample RDD but which 1 partitions and each 
executor has 50 partitions and each machine has 4 physical cores.So do 4 
physical cores parallely try to process these 50 partitions (doing 
multitasking) or will it work in a way that 4 cores will first process first 4 
partitions and then next 4 partitions and so on. 

Thanks and Regards

Aniruddh

 

On Wed, Jul 8, 2015 at 5:09 PM, Evo Eftimov  wrote:

This is most likely due to the internal implementation of ALS in MLib. Probably 
for each parallel unit of execution (partition in Spark terms) the 
implementation allocates and uses a RAM buffer where it keeps interim results 
during the ALS iterations

 

If we assume that the size of that internal RAM buffer is fixed per Unit of 
Execution then Total RAM (20 partitions x fixed RAM buffer) < Total RAM (100 
partitions x fixed RAM buffer) 

 

From: Aniruddh Sharma [mailto:asharma...@gmail.com] 
Sent: Wednesday, July 8, 2015 12:22 PM
To: user@spark.apache.org
Subject: Out of Memory Errors on less number of cores in proportion to 
Partitions in Data

 

Hi,

I am new to Spark. I have done following tests and I am confused in 
conclusions. I have 2 queries.

Following is the detail of test

Test 1) Used 11 Node Cluster where each machine has 64 GB RAM and 4 physical 
cores. I ran a ALS algorithm using MilLib on 1.6 GB data set. I ran 10 
executors and my Rating data set has 20 partitions. It works. In order to 
increase parallelism, I did 100 partitions instead of 20 and now program does 
not work and it throws out of memory error.

 

Query a): As I had 4 cores on each machine , but my number of partitions are 10 
in each executor and my cores are not sufficient for partitions. Is it supposed 
to give memory errors when this kind of misconfiguration.If there are not 
sufficient cores and processing cannot be done in parallel, can different 
partitions not be processed sequentially and operation could have become slow 
rather than throwing memory error.

Query b)  If it gives error, then error message is not meaningful Here my DAG 
was very simple and I could trace that lowering number of partitions is 
working, but if on misconfiguration of cores it throws error, then how to debug 
it in complex DAGs as error does not tell explicitly that problem could be due 
to low number of cores. If my understanding is incorrect, then kindly explain 
the reasons of error in this case

 

Thanks and Regards

Aniruddh

Re: UDF in spark

2015-07-08 Thread vinod kumar

Thank you for quick response Vishnu,

I have following doubts too.

1.Is there is anyway to upload files to HDFS programattically using c#
language?.
2.Is there is any way to automatically load scala block of code (for UDF)
when i start the spark service?

-Vinod

On Wed, Jul 8, 2015 at 7:57 AM, VISHNU SUBRAMANIAN <
johnfedrickena...@gmail.com> wrote:

> HI Vinod,
>
> Yes If you want to use a scala or python function you need the block of
> code.
>
> Only Hive UDF's are available permanently.
>
> Thanks,
> Vishnu
>
> On Wed, Jul 8, 2015 at 5:17 PM, vinod kumar 
> wrote:
>
>> Thanks Vishnu,
>>
>> When restart the service the UDF was not accessible by my query.I need to
>> run the mentioned block again to use the UDF.
>> Is there is any way to maintain UDF in sqlContext permanently?
>>
>> Thanks,
>> Vinod
>>
>> On Wed, Jul 8, 2015 at 7:16 AM, VISHNU SUBRAMANIAN <
>> johnfedrickena...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> sqlContext.udf.register("udfname", functionname _)
>>>
>>> example:
>>>
>>> def square(x:Int):Int = { x * x}
>>>
>>> register udf as below
>>>
>>> sqlContext.udf.register("square",square _)
>>>
>>> Thanks,
>>> Vishnu
>>>
>>> On Wed, Jul 8, 2015 at 2:23 PM, vinod kumar 
>>> wrote:
>>>
 Hi Everyone,

 I am new to spark.may I know how to define and use User Define Function
 in SPARK SQL.

 I want to use defined UDF by using sql queries.

 My Environment
 Windows 8
 spark 1.3.1

 Warm Regards,
 Vinod



>>>
>>
>

1 2 >

1 - 100 of 132 matches

Mail list logo