RE: Getting prediction values in spark mllib

2016-02-11 Thread Chandan Verma
Thanks it got solved ☺

From: Artem Aliev [mailto:artem.al...@gmail.com]
Sent: Thursday, February 11, 2016 3:19 PM
To: Sonal Goyal
Cc: Chandan Verma; user@spark.apache.org
Subject: Re: Getting prediction values in spark mllib

It depends on Algorithm you use
NaiveBayesModel has predictProbabilities method to work dirrectly with 
probabilites
the LogisiticRegresionModel and SVMModel clearThreshold() will make predict 
method returns probabilites as mentioned above

On Thu, Feb 11, 2016 at 11:17 AM, Sonal Goyal 
> wrote:

Looks like you are doing binary classification and you are getting the label 
out. If you clear the model threshold, you should be able to get the raw score.
On Feb 11, 2016 1:32 PM, "Chandan Verma" 
> wrote:

Following is the code Snippet


JavaRDD> predictionAndLabels = data
.map(new 
Function>() {

public Tuple2 call(LabeledPoint p) {

Double prediction = sameModel.predict(p.features());

return new Tuple2(prediction, p.label());

}
});

The line "sameModel.predict(p.features());" gives me the prediction as double 
value (eg 0.0 or 1.0) .
How can i get the prediction value with more digits after decimal 
point.eg 0.2344 etc
===
 DISCLAIMER: The information contained in this message (including any 
attachments) is confidential and may be privileged. If you have received it by 
mistake please notify the sender by return e-mail and permanently delete this 
message and any attachments from your system. Any dissemination, use, review, 
distribution, printing or copying of this message in whole or in part is 
strictly prohibited. Please note that e-mails are susceptible to change. 
CitiusTech shall not be liable for the improper or incomplete transmission of 
the information contained in this communication nor for any delay in its 
receipt or damage to your system. CitiusTech does not guarantee that the 
integrity of this communication has been maintained or that this communication 
is free of viruses, interceptions or interferences. 




Re: Getting prediction values in spark mllib

2016-02-11 Thread Artem Aliev
It depends on Algorithm you use
NaiveBayesModel has predictProbabilities method to work dirrectly with
probabilites
the LogisiticRegresionModel and SVMModel clearThreshold() will make predict
method returns probabilites as mentioned above

On Thu, Feb 11, 2016 at 11:17 AM, Sonal Goyal  wrote:

> Looks like you are doing binary classification and you are getting the
> label out. If you clear the model threshold, you should be able to get the
> raw score.
> On Feb 11, 2016 1:32 PM, "Chandan Verma" 
> wrote:
>
>>
>>
>> Following is the code Snippet
>>
>>
>>
>>
>>
>> JavaRDD> predictionAndLabels = data
>>
>> .map(new
>> Function>() {
>>
>>
>> public Tuple2 call(LabeledPoint p) {
>>
>>
>> Double prediction = sameModel.predict(p.features());
>>
>>
>> return new Tuple2(prediction, p.label());
>>
>>
>> }
>>
>> });
>>
>>
>>
>> The line "sameModel.predict(p.features());" gives me the prediction as
>> double value (eg 0.0 or 1.0) .
>>
>> How can i get the prediction value with more digits after decimal
>> point.eg 0.2344 etc
>> ===
>> DISCLAIMER: The information contained in this message (including any
>> attachments) is confidential and may be privileged. If you have received it
>> by mistake please notify the sender by return e-mail and permanently delete
>> this message and any attachments from your system. Any dissemination, use,
>> review, distribution, printing or copying of this message in whole or in
>> part is strictly prohibited. Please note that e-mails are susceptible to
>> change. CitiusTech shall not be liable for the improper or incomplete
>> transmission of the information contained in this communication nor for any
>> delay in its receipt or damage to your system. CitiusTech does not
>> guarantee that the integrity of this communication has been maintained or
>> that this communication is free of viruses, interceptions or interferences.
>> 
>>
>>
>


Re: Getting prediction values in spark mllib

2016-02-11 Thread Sonal Goyal
Looks like you are doing binary classification and you are getting the
label out. If you clear the model threshold, you should be able to get the
raw score.
On Feb 11, 2016 1:32 PM, "Chandan Verma" 
wrote:

>
>
> Following is the code Snippet
>
>
>
>
>
> JavaRDD> predictionAndLabels = data
>
> .map(new
> Function>() {
>
>
> public Tuple2 call(LabeledPoint p) {
>
>
> Double prediction = sameModel.predict(p.features());
>
>
> return new Tuple2(prediction, p.label());
>
>
> }
>
> });
>
>
>
> The line "sameModel.predict(p.features());" gives me the prediction as
> double value (eg 0.0 or 1.0) .
>
> How can i get the prediction value with more digits after decimal point.eg
> 0.2344 etc
> ===
> DISCLAIMER: The information contained in this message (including any
> attachments) is confidential and may be privileged. If you have received it
> by mistake please notify the sender by return e-mail and permanently delete
> this message and any attachments from your system. Any dissemination, use,
> review, distribution, printing or copying of this message in whole or in
> part is strictly prohibited. Please note that e-mails are susceptible to
> change. CitiusTech shall not be liable for the improper or incomplete
> transmission of the information contained in this communication nor for any
> delay in its receipt or damage to your system. CitiusTech does not
> guarantee that the integrity of this communication has been maintained or
> that this communication is free of viruses, interceptions or interferences.
> 
>
>


[OT] Apache Spark Jobs in Kochi, India

2016-02-11 Thread Andrew Holway
Hello,

I'm not sure how appropriate job postings are to a user group.

We're getting deep into spark and are looking for some talent in our Kochi
office.

http://bit.ly/Spark-Eng - Apache Spark Engineer / Architect - Kochi
http://bit.ly/Spark-Dev - Lead Apache Spark Developer - Kochi

Sorry for the noise!

Cheers,

Andrew


Getting prediction values in spark mllib

2016-02-11 Thread Chandan Verma
 

Following is the code Snippet

 

 

JavaRDD> predictionAndLabels = data

.map(new
Function>() {

 
public Tuple2 call(LabeledPoint p) {

 
Double prediction = sameModel.predict(p.features());

 
return new Tuple2(prediction, p.label());

 
}

});



The line "sameModel.predict(p.features());" gives me the prediction as
double value (eg 0.0 or 1.0) .

How can i get the prediction value with more digits after decimal point.eg
0.2344 etc




===
DISCLAIMER:
The information contained in this message (including any attachments) is 
confidential and may be privileged. If you have received it by mistake please 
notify the sender by return e-mail and permanently delete this message and any 
attachments from your system. Any dissemination, use, review, distribution, 
printing or copying of this message in whole or in part is strictly prohibited. 
Please note that e-mails are susceptible to change. CitiusTech shall not be 
liable for the improper or incomplete transmission of the information contained 
in this communication nor for any delay in its receipt or damage to your 
system. CitiusTech does not guarantee that the integrity of this communication 
has been maintained or that this communication is free of viruses, 
interceptions or interferences. 



spark shell ini file

2016-02-11 Thread Mich Talebzadeh
 

Hi, 

in Hive one can use -I parameter to preload certain setting into the
beeline etc. 

Is there equivalent parameter for spark-shell as well. 

for example 

spark-shell --master spark://50.140.197.217:7077 

can I pass a parameter file? 

Thanks 

-- 

Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Re: retrieving all the rows with collect()

2016-02-11 Thread Mich Talebzadeh
 

Thanks Jacob much appreciated 

Mich 

On 11/02/2016 00:01, Jakob Odersky wrote: 

> Exactly!
> As a final note, `foreach` is also defined on RDDs. This means that
> you don't need to `collect()` the results into an array (which could
> give you an OutOfMemoryError in case the RDD is really really large)
> before printing them.
> 
> Personally, when I learn using a new library, I like to look at its
> Scaladoc 
> (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
>  [1]
> for Spark) and test it in the REPL/worksheets (for Spark you already
> have `spark-shell`)
> 
> best,
> --Jakob

 

Links:
--
[1]
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD

Re: SparkSQL parallelism

2016-02-11 Thread Rishi Mishra
I am not sure why all 3 nodes should query.  If you have not mentioned any
partitions it should only be one partition of JDBCRDD where all dataset
should reside.


On Fri, Feb 12, 2016 at 10:15 AM, Madabhattula Rajesh Kumar <
mrajaf...@gmail.com> wrote:

> Hi,
>
> I have a spark cluster with One Master and 3 worker nodes. I have written
> a below code to fetch the records from oracle using sparkSQL
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val employees = sqlContext.read.format("jdbc").options(
> Map("url" -> "jdbc:oracle:thin:@:1525:SID",
> "dbtable" -> "(select * from employee where name like '%18%')",
> "user" -> "username",
> "password" -> "password")).load
>
> I have a submitted this job to spark cluster using spark-submit command.
>
>
>
> *Looks like, All 3 workers are executing same query and fetching same
> data. It means, it is making 3 jdbc calls to oracle.*
> *How to make this code to make a single jdbc call to oracle(In case of
> more than one worker) ?*
>
> Please help me to resolve this use case
>
> Regards,
> Rajesh
>
>
>


-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra


Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread Sebastian Piu
Have you tried using fair scheduler and queues
On 12 Feb 2016 4:24 a.m., "p pathiyil"  wrote:

> With this setting, I can see that the next job is being executed before
> the previous one is finished. However, the processing of the 'hot'
> partition eventually hogs all the concurrent jobs. If there was a way to
> restrict jobs to be one per partition, then this setting would provide the
> per-partition isolation.
>
> Is there anything in the framework which would give control over that
> aspect ?
>
> Thanks.
>
>
> On Thu, Feb 11, 2016 at 9:55 PM, Cody Koeninger 
> wrote:
>
>> spark.streaming.concurrentJobs
>>
>>
>> see e.g. 
>> http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming
>>
>>
>> On Thu, Feb 11, 2016 at 9:33 AM, p pathiyil  wrote:
>>
>>> Thanks for the response Cody.
>>>
>>> The producers are out of my control, so can't really balance the
>>> incoming content across the various topics and partitions. The number of
>>> topics and partitions are quite large and the volume across then not very
>>> well known ahead of time. So it is quite hard to segregate low and high
>>> volume topics in to separate driver programs.
>>>
>>> Will look at shuffle / repartition.
>>>
>>> Could you share the setting for starting another batch in parallel ? It
>>> might be ok to call the 'save' of the processed messages out of order if
>>> that is the only consequence of this setting.
>>>
>>> When separate DStreams are created per partition (and if union() is not
>>> called on them), what aspect of the framework still ties the scheduling of
>>> jobs across the partitions together ? Asking this to see if creating
>>> multiple threads in the driver and calling createDirectStream per partition
>>> in those threads can provide isolation.
>>>
>>>
>>>
>>> On Thu, Feb 11, 2016 at 8:14 PM, Cody Koeninger 
>>> wrote:
>>>
 The real way to fix this is by changing partitioning, so you don't have
 a hot partition.  It would be better to do this at the time you're
 producing messages, but you can also do it with a shuffle / repartition
 during consuming.

 There is a setting to allow another batch to start in parallel, but
 that's likely to have unintended consequences.

 On Thu, Feb 11, 2016 at 7:59 AM, p pathiyil  wrote:

> Hi,
>
> I am looking at a way to isolate the processing of messages from each
> Kafka partition within the same driver.
>
> Scenario: A DStream is created with the createDirectStream call by
> passing in a few partitions. Let us say that the streaming context is
> defined to have a time duration of 2 seconds. If the processing of 
> messages
> from a single partition takes more than 2 seconds (while all the others
> finish much quicker), it seems that the next set of jobs get scheduled 
> only
> after the processing of that last partition. This means that the delay is
> effective for all partitions and not just the partition that was truly the
> cause of the delay. What I would like to do is to have the delay only
> impact the 'slow' partition.
>
> Tried to create one DStream per partition and then do a union of all
> partitions, (similar to the sample in
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
> but that didn't seem to help.
>
> Please suggest the correct approach to solve this issue.
>
> Thanks,
> Praveen.
>


>>>
>>
>


mllib:Survival Analysis : assertion failed: AFTAggregator loss sum is infinity. Error for unknown reason.

2016-02-11 Thread Stuti Awasthi
Hi All,
Im wanted to try Survival Analysis on Spark 1.6. I am successfully able to run 
the AFT example provided. Now I tried to train the model with Ovarian data 
which is standard data comes with Survival library in R.
Default Column Name :  Futime,fustat,age,resid_ds,rx,ecog_ps

Here are the steps I have done :

* Loaded the data from csv to dataframe labeled as
val ovarian_data = sqlContext.read
  .format("com.databricks.spark.csv")
  .option("header", "true") // Use first line of all files as header
  .option("inferSchema", "true") // Automatically infer data types
  .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds", "rx", 
"ecog_ps")

* Utilize the VectorAssembler() to create features from "age", 
"resid_ds", "rx", "ecog_ps" like
val assembler = new VectorAssembler()
.setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
.setOutputCol("features")


* Then I create a new dataframe with only 3 colums as :
val training = finalDf.select("label", "censor", "features")



* Finally Im passing it to AFT
val model = aft.fit(training)

Im getting the error as :
java.lang.AssertionError: assertion failed: AFTAggregator loss sum is infinity. 
Error for unknown reason.
   at scala.Predef$.assert(Predef.scala:179)
   at 
org.apache.spark.ml.regression.AFTAggregator.add(AFTSurvivalRegression.scala:480)
   at 
org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:522)
   at 
org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(AFTSurvivalRegression.scala:521)
   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
   at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)

I have tried to print the schema :
()root
|-- label: double (nullable = true)
|-- censor: double (nullable = true)
|-- features: vector (nullable = true)

Sample data training looks like
[59.0,1.0,[72.3315,2.0,1.0,1.0]]
[115.0,1.0,[74.4932,2.0,1.0,1.0]]
[156.0,1.0,[66.4658,2.0,1.0,2.0]]
[421.0,0.0,[53.3644,2.0,2.0,1.0]]
[431.0,1.0,[50.3397,2.0,1.0,1.0]]

Im not able to understand about the error, as if I use same data and create the 
denseVector as given in Sample example of AFT, then code works completely fine. 
But I would like to read the data from CSV file and then proceed.

Please suggest

Thanks 
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




??????off-heap certain operations

2016-02-11 Thread Sea
spark.memory.offHeap.enabled (default is false) , it is wrong in spark docs. 
Spark1.6 do not recommend to use off-heap memory.




--  --
??: "Ovidiu-Cristian MARCU";;
: 2016??2??12??(??) 5:51
??: "user"; 

: off-heap certain operations



Hi,

Reading though the latest documentation for Memory management I can see that 
the parameter spark.memory.offHeap.enabled (true by default) is described with 
??If true, Spark will attempt to use off-heap memory for certain operations?? 
[1].


Can you please describe the certain operations you are referring to?  


http://spark.apache.org/docs/latest/configuration.html#memory-management


Thank!


Best,
Ovidiu

Re: Passing a dataframe to where clause + Spark SQL

2016-02-11 Thread Rishabh Wadhawan
Hi Divya
Considering you are able to successfully load both tables testCond and test as 
data frames.
As now taking your case:
when you do val condval = testCond.select(“Cond”) //Where Cond is a column 
name, here condval is a DataFrame.Even if it has one row, it is still a data 
frame
if you want to filter data in data frame (test) according to values in data 
frame condval, as of what I understand you are trying to do.

the appropriate syntax would be this in java (I am not sure about scala syntax 
but i am sure its easy to port)
DataFrame newDataFrame = test.join(test, 
test.col(“Cond”).equalTo(condval.col(“Cond”)), “inner”);

This would give you rows in test data frame that match the Cond column in 
condval.

Please tell me if that helps.

> On Feb 10, 2016, at 9:23 PM, Divya Gehlot  wrote:
> 
> Hi,
> //Loading all the DB Properties
> val options1 = Map("url" -> 
> "jdbc:oracle:thin:@xx.xxx.xxx.xx:1521:dbname","user"->"username","password"->"password","dbtable"
>  -> "TESTCONDITIONS")
> val testCond  = sqlContext.load("jdbc",options1 )
> val condval = testCond.select("Cond")
> 
> testCond.show()
> val options2 = Map("url" -> 
> "jdbc:oracle:thin:@xx.xxx.xxx.xx:1521:dbanme","user"->"username","password"->"password","dbtable"
>  -> "Test")
> val test= sqlContext.load("jdbc",options2 )
> test.select.where(condval ) //gives error as cannot convert sql.Column to 
> Dataframe
> 
> test.select().where(???)
> 
> My TestConditions table has only one row
> which looks like year = 1965 and month = ;december'
> 
> Can I convert sql.Column to list and pass ?
> I am new Spark and scala.
> 
> 
> Will really appreciate the help.
> 
> Thanks,
> Divya 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark workers disconnecting on 1.5.2

2016-02-11 Thread Andy Max
I launched a 4 node Spark 1.5.2 cluster. No activity for a day or so. Now
noticed that few of the workers are disconnected.
Don't see this issue on Spark 1.4 or Spark 1.3.

Would appreciate any pointers.

Thx


Re: Scala types to StructType

2016-02-11 Thread Rishabh Wadhawan
I had the same issue. I resolved it in Java, but I am pretty sure it would work 
with scala too. Its kind of a gross hack. But what I did is say I had a table 
in Mysql with 1000 columns
what is did is that I threw a jdbc query to extracted the schema of the table. 
I stored that schema and wrote a map function to create StructFields using 
structType and Row.Factory. Then I took that table loaded as a dataFrame, event 
though it had a schema. I converted that data frame into an RDD, this is when 
it lost the schema. Then performed something using that RDD and then converted 
back that RDD with the structfield.
If your source is structured type then it would be better if you can load it 
directly as a DF that way you can preserve the schema. However, in your case 
you should do something like this
List fields = new ArrayList
for(keys in MAP)
 fields.add(DataTypes.createStructField(keys, DataTypes.StringType, true));

StrructType schemaOfDataFrame = DataTypes.createStructType(conffields);

sqlcontext.createDataFrame(rdd, schemaOfDataFrame);

This is how I would do it to make it in Java, not sure about scala syntax. 
Please tell me if that helped.
> On Feb 11, 2016, at 7:20 AM, Fabian Böhnlein  
> wrote:
> 
> Hi all,
> 
> is there a way to create a Spark SQL Row schema based on Scala data types 
> without creating a manual mapping? 
> 
> That's the only example I can find which doesn't require 
> spark.sql.types.DataType already as input, but it requires to define them as 
> Strings.
> 
> * val struct = (new StructType)
> *   .add("a", "int")
> *   .add("b", "long")
> *   .add("c", "string")
> 
> 
> Specifically I have an RDD where each element is a Map of 100s of variables 
> with different data types which I want to transform to a DataFrame
> where the keys should end up as the column names:
> Map ("Amean" -> 20.3, "Asize" -> 12, "Bmean" -> )
> 
> Is there a different possibility than building a mapping from the values' 
> .getClass to the Spark SQL DataTypes?
> 
> 
> Thanks,
> Fabian
> 
> 
> 



Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Cody Koeninger
That's what the kafkaParams argument is for.  Not all of the kafka
configuration parameters will be relevant, though.

On Thu, Feb 11, 2016 at 12:07 PM, Nipun Arora 
wrote:

> Hi ,
>
> Thanks for the explanation and the example link. Got it working.
> A follow up question. In Kafka one can define properties as follows:
>
> Properties props = new Properties();
> props.put("zookeeper.connect", zookeeper);
> props.put("group.id", groupId);
> props.put("zookeeper.session.timeout.ms", "500");
> props.put("zookeeper.sync.time.ms", "250");
> props.put("auto.commit.interval.ms", "1000");
>
>
> How can I do the same for the receiver inside spark-streaming for Spark V1.3.1
>
>
> Thanks
>
> Nipun
>
>
>
> On Wed, Feb 10, 2016 at 3:59 PM Cody Koeninger  wrote:
>
>> It's a pair because there's a key and value for each message.
>>
>> If you just want a single topic, put a single topic in the map of topic
>> -> number of partitions.
>>
>> See
>>
>>
>> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>>
>>
>> On Wed, Feb 10, 2016 at 1:28 PM, Nipun Arora 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying some basic integration and was going through the manual.
>>>
>>> I would like to read from a topic, and get a JavaReceiverInputDStream
>>>  for messages in that topic. However the example is of
>>> JavaPairReceiverInputDStream<>. How do I get a stream for only a single
>>> topic in Java?
>>>
>>> Reference Page:
>>> https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
>>>
>>>  import org.apache.spark.streaming.kafka.*;
>>>
>>>  JavaPairReceiverInputDStream kafkaStream =
>>>  KafkaUtils.createStream(streamingContext,
>>>  [ZK quorum], [consumer group id], [per-topic number of Kafka 
>>> partitions to consume]);
>>>
>>>
>>> Also in the example above what does  signify?
>>>
>>> Thanks
>>> Nipun
>>>
>>
>>


best practices? spark streaming writing output detecting disk full error

2016-02-11 Thread Andy Davidson
We recently started a Spark/Spark Streaming POC. We wrote a simple streaming
app in java to collect tweets. We choose twitter because we new we get a lot
of data and probably lots of burst. Good for stress testing

We spun up  a couple of small clusters using the spark-ec2 script. In one
cluster we wrote all the tweets to HDFS in a second cluster we write all the
tweets to S3

We were surprised that our HDFS file system reached 100 % of capacity in a
few days. This resulted with ³all data nodes dead². We where surprised
because the actually stream app continued to run. We had no idea we had a
problem until a day or two after the disk became full when we noticed we
where missing a lot of data.

We ran into a similar problem with our s3 cluster. We had a permission
problem and where un able to write any data yet our stream app continued to
run


Spark generated mountains of logs,We are using the stand alone cluster
manager. All the log levels wind up in the ³error² log. Making it hard to
find real errors and warnings using the web UI. Our app is written in Java
so my guess is the write errors must be unable. I.E. We did not know in
advance that they could occur . They are basically undocumented.



We are a small shop. Running something like splunk would add a lot of
expense and complexity for us at this stage of our growth.

What are best practices

Kind Regards

Andy




Re: How to parallel read files in a directory

2016-02-11 Thread Jakob Odersky
Hi Junjie,

How do you access the files currently? Have you considered using hdfs? It's
designed to be distributed across a cluster and Spark has built-in support.

Best,
--Jakob
On Feb 11, 2016 9:33 AM, "Junjie Qian"  wrote:

> Hi all,
>
> I am working with Spark 1.6, scala and have a big dataset divided into
> several small files.
>
> My question is: right now the read operation takes really long time and
> often has RDD warnings. Is there a way I can read the files in parallel,
> that all nodes or workers read the file at the same time?
>
> Many thanks
> Junjie
>


Inserting column to DataFrame

2016-02-11 Thread Zsolt Tóth
Hi,

I'd like to append a column of a dataframe to another DF (using Spark
1.5.2):

DataFrame outputDF = unlabelledDF.withColumn("predicted_label",
predictedDF.col("predicted"));

I get the following exception:

java.lang.IllegalArgumentException: requirement failed: DataFrame must have
the same schema as the relation to which is inserted.
DataFrame schema: StructType(StructField(predicted_label,DoubleType,true),
...
Relation schema: StructType(StructField(predicted_label,DoubleType,true),
...

The interesting part is that the two schemas in the exception are exactly
the same.
The same code with other input data (with fewer, both numerical and
non-numerical column) succeeds.
Any idea why this happens?


Re: spark shell ini file

2016-02-11 Thread Ted Yu
Please see:

[SPARK-13086][SHELL] Use the Scala REPL settings, to enable things like `-i
file`

On Thu, Feb 11, 2016 at 1:45 AM, Mich Talebzadeh <
mich.talebza...@cloudtechnologypartners.co.uk> wrote:

> Hi,
>
>
>
> in Hive one can use -I parameter to preload certain setting into the
> beeline etc.
>
>
>
> Is there equivalent parameter for spark-shell as well.
>
>
>
> for example
>
> spark-shell --master spark://50.140.197.217:7077
>
> can I pass a parameter file?
>
>
>
> Thanks
>
>
> --
>
> Mich Talebzadeh
>
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Cloud Technology Partners 
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
> the responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Cloud Technology partners Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>
>
>


Re: Spark Certification

2016-02-11 Thread Timothy Spann
I was wondering that as well.

Also is it fully updated for 1.6?

Tim
http://airisdata.com/
http://sparkdeveloper.com/


From: naga sharathrayapati 
>
Date: Wednesday, February 10, 2016 at 11:36 PM
To: "user@spark.apache.org" 
>
Subject: Spark Certification

Hello All,

I am planning on taking Spark Certification and I was wondering If one has to 
be well equipped with  MLib & GraphX as well or not ?

Please advise

Thanks


Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
Hi,

I am looking at a way to isolate the processing of messages from each Kafka
partition within the same driver.

Scenario: A DStream is created with the createDirectStream call by passing
in a few partitions. Let us say that the streaming context is defined to
have a time duration of 2 seconds. If the processing of messages from a
single partition takes more than 2 seconds (while all the others finish
much quicker), it seems that the next set of jobs get scheduled only after
the processing of that last partition. This means that the delay is
effective for all partitions and not just the partition that was truly the
cause of the delay. What I would like to do is to have the delay only
impact the 'slow' partition.

Tried to create one DStream per partition and then do a union of all
partitions, (similar to the sample in
http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
but that didn't seem to help.

Please suggest the correct approach to solve this issue.

Thanks,
Praveen.


Dataframes

2016-02-11 Thread Gaurav Agarwal
Hi

Can we load 5 data frame for 5 tables in one spark context.
I am asking why because we have to give
Map options= new hashmap();

Options.put(driver,"");
Options.put(URL,"");
Options.put(dbtable,"");

I can give only table query at time in dbtable options .
How will I register multiple queries and dataframes

Thankw
with all table.

Thanks
+


RE: Dataframes

2016-02-11 Thread Prashant Verma
Hi Gaurav,
You can try something like this.

SparkConf conf = new SparkConf();
JavaSparkContext sc = new JavaSparkContext(conf);
SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
Class.forName("com.mysql.jdbc.Driver");
String url="url";
Properties prop = new java.util.Properties();
prop.setProperty("user","user");
prop.setProperty("password","password");
DataFrame tableA = sqlContext.read().jdbc(url,"tableA",prop);
DataFrame tableB = sqlContext.read().jdbc(url,"tableB",prop);

Hope this helps.

Thanks,
Prashant



From: Gaurav Agarwal [mailto:gaurav130...@gmail.com]
Sent: Thursday, February 11, 2016 7:35 PM
To: user@spark.apache.org
Subject: Dataframes


Hi

Can we load 5 data frame for 5 tables in one spark context.
I am asking why because we have to give
Map options= new hashmap();

Options.put(driver,"");
Options.put(URL,"");
Options.put(dbtable,"");

I can give only table query at time in dbtable options .
How will I register multiple queries and dataframes

Thankw
with all table.

Thanks
+


Scala types to StructType

2016-02-11 Thread Fabian Böhnlein

Hi all,

is there a way to create a Spark SQL Row schema based on Scala data 
types without creating a manual mapping?


That's the only example I can find which doesn't require 
spark.sql.types.DataType already as input, but it requires to define 
them as Strings.


* val struct = (new StructType) * .add("a", "int") * .add("b", "long") * 
.add("c", "string")




Specifically I have an RDD where each element is a Map of 100s of 
variables with different data types which I want to transform to a DataFrame

where the keys should end up as the column names:

Map ("Amean" -> 20.3, "Asize" -> 12, "Bmean" -> )


Is there a different possibility than building a mapping from the 
values' .getClass to the Spark SQL DataTypes?



Thanks,
Fabian




Question on Spark architecture and DAG

2016-02-11 Thread Mich Talebzadeh
Hi,

I have used Hive on Spark engine and of course Hive tables and its pretty
impressive comparing Hive using MR engine.

 

Let us assume that I use spark shell. Spark shell is a client that connects
to spark master running on a host and port like below

spark-shell --master spark://50.140.197.217:7077:

Ok once I connect I create an RDD to read a text file:

val oralog = sc.textFile("/test/alert_mydb.log")

I then search for word Errors in that file

oralog.filter(line => line.contains("Errors")).collect().foreach(line =>
println(line))

 

Questions:

 

1.  In order to display the lines (the result set) containing word
"Errors", the content of the file (i.e. the blocks on HDFS) need to be read
into memory. Is my understanding correct that as per RDD notes those blocks
from the file will be partitioned across the cluster and each node will have
its share of blocks in memory?
2.  Once the result is returned back they need to be sent to the client
that has made the connection to master. I guess this is a simple TCP
operation much like any relational database sending the result back?
3.  Once the results are returned if no request has been made to keep
the data in memory, those blocks in memory will be discarded?
4.  Regardless of the storage block size on disk (128MB, 256MB etc), the
memory pages are 2K in relational databases? Is this the case in Spark as
well?

Thanks,

 Mich Talebzadeh

 

LinkedIn

https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

  http://talebzadehmich.wordpress.com

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Technology Ltd, its subsidiaries nor their
employees accept any responsibility.

 

 



Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
I'm using the Kafka direct stream api but I can have a look on extending it
to have this behaviour

Thanks!
On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" 
wrote:

> Are you using a custom input dstream? If so, you can make the `compute`
> method return None to skip a batch.
>
> On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu 
> wrote:
>
>> I was wondering if there is there any way to skip batches with zero
>> events when streaming?
>> By skip I mean avoid the empty rdd from being created at all?
>>
>
>


newbie unable to write to S3 403 forbidden error

2016-02-11 Thread Andy Davidson
I am using spark 1.6.0 in a cluster created using the spark-ec2 script. I am
using the standalone cluster manager

My java streaming app is not able to write to s3. It appears to be some for
of permission problem.

Any idea what the problem might be?

I tried use the IAM simulator to test the policy. Everything seems okay. Any
idea how I can debug this problem?

Thanks in advance

Andy

JavaSparkContext jsc = new JavaSparkContext(conf);


// I did not include the full key in my email
   // the keys do not contain Œ\¹
   // these are the keys used to create the cluster. They belong to the
IAM user andy
jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "AKIAJREX");

jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey",
"uBh9v1hdUctI23uvq9qR");




  private static void saveTweets(JavaDStream jsonTweets, String
outputURI) {

jsonTweets.foreachRDD(new VoidFunction2() {

private static final long serialVersionUID = 1L;



@Override

public void call(JavaRDD rdd, Time time) throws
Exception {

if(!rdd.isEmpty()) {

// bucket name is Œcom.pws.twitter¹ it has a folder Œjson'

String dirPath =
"s3n://s3-us-west-1.amazonaws.com/com.pws.twitter/json² + "-" +
time.milliseconds();

rdd.saveAsTextFile(dirPath);

}  

}

});




Bucket name : com.pws.titter
Bucket policy (I replaced the account id)

{
"Version": "2012-10-17",
"Id": "Policy1455148808376",
"Statement": [
{
"Sid": "Stmt1455148797805",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::123456789012:user/andy"
},
"Action": "s3:*",
"Resource": "arn:aws:s3:::com.pws.twitter/*"
}
]
}






Testing email please ignore

2016-02-11 Thread Mich Talebzadeh
 

-- 

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential.
This message is for the designated recipient only, if you are not the
intended recipient, you should destroy it immediately. Any information
in this message shall not be understood as given or endorsed by Cloud
Technology Partners Ltd, its subsidiaries or their employees, unless
expressly so stated. It is the responsibility of the recipient to ensure
that this email is virus free, therefore neither Cloud Technology
partners Ltd, its subsidiaries nor their employees accept any
responsibility.

 

Re: cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Thanks for the below info.
I have one more question. I have my own framework where the Sql query is
already build ,so I am thinking instead of using data frame filter criteria
I could use
Dataframe d=sqlcontext.Sql(" and append query here").
d.printschema()
List row =d.collectaslist();

Here when I say d.collectaslist then it will go to database and execute
query there.

Nothing will be cached there .please confirm my query.

Thanks



On Feb 12, 2016 12:01 AM, "Rishabh Wadhawan"  wrote:

> Hi Gaurav Spark will not load the tables into memory at both the points as
> DataFrames are just abstractions of something that might happen in future
> when you actually throw an (ACTION) like say df.collectAsList or df.show.
> When you run DataFrame df =  sContext.load("jdbc","(select * from employee)
> as employee); all spark does is that it just generates a queryExecution
> plan. That plan gets executed when you throw and ACTION statement.
> Take this example.
>
> DataFrame df = sContext.load("jdbc","(select * from employee) as
> employee); // Spark makes a query execution tree
> df.filter(df.col(“wmpid”).equalTo(“!”));  // Spark adds this to query
> execution tree
> System.out.println(df.queryExecution()) // Print out the query execution
> plan, with physical and logical plans.
>
> df.show(); /*This is when spark starts loading data into memory and
> executes the optimized execution plan, according to the query execution
> tree. This is the point when data gets* materialized
>   */
>
> > On Feb 11, 2016, at 11:20 AM, Gaurav Agarwal 
> wrote:
> >
> > Hi
> >
> > When the dataFrame will load the table into memory when it reads from
> HIVe/Phoenix or from any database.
> > These are two points where need one info , when tables will be loaded
> into memory or cached when at  point 1 or point 2 below.
> >
> >  1. DataFrame df = sContext.load("jdbc","(select * from employee) as
> employee);
> >
> > 2.sContext.sql("select * from employee where wmpid="!");
> >
> >
> >
> >
> >
> >
>
>


Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Are you using a custom input dstream? If so, you can make the `compute`
method return None to skip a batch.

On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu 
wrote:

> I was wondering if there is there any way to skip batches with zero events
> when streaming?
> By skip I mean avoid the empty rdd from being created at all?
>


Re: Computing hamming distance over large data set

2016-02-11 Thread Brian Morton
Karl,

This is tremendously useful.  Thanks very much for your insight.

Brian

On Thu, Feb 11, 2016 at 12:58 PM, Karl Higley  wrote:

> Hi,
>
> It sounds like you're trying to solve the approximate nearest neighbor
> (ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
> approach isn't likely to help all that much, because the number of pairwise
> comparisons grows quickly as the size of the dataset increases. I'd look at
> ways to avoid computing the similarity between all pairs, like
> locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH --
> it's currently slated for the Spark 2.0.0 release, but AFAIK development on
> it hasn't started yet.)
>
> There are a bunch of Python libraries that support various approaches to
> the ANN problem (including LSH), though. It sounds like you need fast
> lookups, so you might check out https://github.com/spotify/annoy. For
> other alternatives, see this performance comparison of Python ANN libraries
> : https://github.com/erikbern/ann-benchmarks.
>
> Hope that helps,
> Karl
>
> On Wed, Feb 10, 2016 at 10:29 PM rokclimb15  wrote:
>
>> Hi everyone, new to this list and Spark, so I'm hoping someone can point
>> me
>> in the right direction.
>>
>> I'm trying to perform this same sort of task:
>>
>> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>>
>> and I'm running into the same problem - it doesn't scale.  Even on a very
>> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
>> find a match with 30 million+ rows.
>>
>> What I would like to do is to load this data set from MySQL into Spark and
>> compute the Hamming distance using all available cores, then select the
>> rows
>> matching a maximum distance.  I'm most familiar with Python, so would
>> prefer
>> to use that.
>>
>> I found an example of loading data from MySQL
>>
>>
>> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>>
>> I found a related DataFrame commit and docs, but I'm not exactly sure how
>> to
>> put this all together.
>>
>>
>> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>>
>>
>> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>>
>> Could anyone please point me to a similar example I could follow as a
>> Spark
>> newb to try this out?  Is this even worth attempting, or will it similarly
>> fail performance-wise?
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Skip empty batches - spark streaming

2016-02-11 Thread Shixiong(Ryan) Zhu
Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty. Feel
free to send a PR to improve it.

On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu 
wrote:

> I'm using the Kafka direct stream api but I can have a look on extending
> it to have this behaviour
>
> Thanks!
> On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" 
> wrote:
>
>> Are you using a custom input dstream? If so, you can make the `compute`
>> method return None to skip a batch.
>>
>> On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu 
>> wrote:
>>
>>> I was wondering if there is there any way to skip batches with zero
>>> events when streaming?
>>> By skip I mean avoid the empty rdd from being created at all?
>>>
>>
>>


Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
Yes, and as far as I recall it also has partitions (empty) which screws up
the isEmpty call if the rdd has been transformed down the line. I will have
a look tomorrow at the office and see if I can collaborate
On 11 Feb 2016 9:14 p.m., "Shixiong(Ryan) Zhu" 
wrote:

> Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty.
> Feel free to send a PR to improve it.
>
> On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu 
> wrote:
>
>> I'm using the Kafka direct stream api but I can have a look on extending
>> it to have this behaviour
>>
>> Thanks!
>> On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" 
>> wrote:
>>
>>> Are you using a custom input dstream? If so, you can make the `compute`
>>> method return None to skip a batch.
>>>
>>> On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu 
>>> wrote:
>>>
 I was wondering if there is there any way to skip batches with zero
 events when streaming?
 By skip I mean avoid the empty rdd from being created at all?

>>>
>>>
>


Re: Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Sebastian Piu
Looks like mapWithState could help you?
On 11 Feb 2016 8:40 p.m., "Abhishek Anand"  wrote:

> Hi All,
>
> I have an use case like follows in my production environment where I am
> listening from kafka with slideInterval of 1 min and windowLength of 2
> hours.
>
> I have a JavaPairDStream where for each key I am getting the same key but
> with different value,which might appear in the same batch or some next
> batch.
>
> When the key appears second time I need to update a field in value of
> previous key with a field in the later key. The keys for which the
> combination keys do not come should be rejected after 2 hours.
>
> At the end of each second I need to output the result to external database.
>
> For example :
>
> Suppose valueX is object of MyClass with fields int a, String b
> At t=1sec I am getting
> key0,value0(0,"prev0")
> key1,value1 (1, "prev1")
> key2,value2 (2,"prev2")
> key2,value3 (3, "next2")
>
> Output to database after 1 sec
> key2, newValue (2,"next2")
>
> At t=2 sec getting
> key3,value4(4,"prev3")
> key1,value5(5,"next1")
>
> Output to database after 2 sec
> key1,newValue(1,"next1")
>
> At t=3 sec
> key4,value6(6,"prev4")
> key3,value7(7,"next3")
> key5,value5(8,"prev5")
> key5,value5(9,"next5")
> key0,value0(10,"next0")
>
> Output to database after 3 sec
> key0,newValue(0,"next0")
> key3,newValue(4,"next3")
> key5,newValue(8,"next5")
>
>
> Please suggest how this can be achieved.
>
>
> Thanks a lot 
> Abhi
>
>
>


Stateful Operation on JavaPairDStream Help Needed !!

2016-02-11 Thread Abhishek Anand
Hi All,

I have an use case like follows in my production environment where I am
listening from kafka with slideInterval of 1 min and windowLength of 2
hours.

I have a JavaPairDStream where for each key I am getting the same key but
with different value,which might appear in the same batch or some next
batch.

When the key appears second time I need to update a field in value of
previous key with a field in the later key. The keys for which the
combination keys do not come should be rejected after 2 hours.

At the end of each second I need to output the result to external database.

For example :

Suppose valueX is object of MyClass with fields int a, String b
At t=1sec I am getting
key0,value0(0,"prev0")
key1,value1 (1, "prev1")
key2,value2 (2,"prev2")
key2,value3 (3, "next2")

Output to database after 1 sec
key2, newValue (2,"next2")

At t=2 sec getting
key3,value4(4,"prev3")
key1,value5(5,"next1")

Output to database after 2 sec
key1,newValue(1,"next1")

At t=3 sec
key4,value6(6,"prev4")
key3,value7(7,"next3")
key5,value5(8,"prev5")
key5,value5(9,"next5")
key0,value0(10,"next0")

Output to database after 3 sec
key0,newValue(0,"next0")
key3,newValue(4,"next3")
key5,newValue(8,"next5")


Please suggest how this can be achieved.


Thanks a lot 
Abhi


Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
I was wondering if there is there any way to skip batches with zero events
when streaming?
By skip I mean avoid the empty rdd from being created at all?


Re: Skip empty batches - spark streaming

2016-02-11 Thread Cody Koeninger
Please don't change the behavior of DirectKafkaInputDStream.
Returning an empty rdd is (imho) the semantically correct thing to do, and
some existing jobs depend on that behavior.

If it's really an issue for you, you can either override
directkafkainputdstream, or just check isEmpty as the first thing you do
with the rdd (before any transformations)

In any recent version of spark, isEmpty on a KafkaRDD is a driver-side only
operation that is basically free.


On Thu, Feb 11, 2016 at 3:19 PM, Sebastian Piu 
wrote:

> Yes, and as far as I recall it also has partitions (empty) which screws up
> the isEmpty call if the rdd has been transformed down the line. I will have
> a look tomorrow at the office and see if I can collaborate
> On 11 Feb 2016 9:14 p.m., "Shixiong(Ryan) Zhu" 
> wrote:
>
>> Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty.
>> Feel free to send a PR to improve it.
>>
>> On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu 
>> wrote:
>>
>>> I'm using the Kafka direct stream api but I can have a look on extending
>>> it to have this behaviour
>>>
>>> Thanks!
>>> On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" 
>>> wrote:
>>>
 Are you using a custom input dstream? If so, you can make the `compute`
 method return None to skip a batch.

 On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu  wrote:

> I was wondering if there is there any way to skip batches with zero
> events when streaming?
> By skip I mean avoid the empty rdd from being created at all?
>


>>


Re: Skip empty batches - spark streaming

2016-02-11 Thread Andy Davidson

You can always call rdd.isEmpty()

Andy

private static void save(JavaDStream jsonRdd, String outputURI)
{

jsonTweets.foreachRDD(new VoidFunction2() {

private static final long serialVersionUID = 1L;



@Override

public void call(JavaRDD rdd, Time time) throws
Exception {

if(!rdd.isEmpty()) {

String dirPath = outputURI + "-" + time.milliseconds();

rdd.saveAsTextFile(dirPath);

}  

}

});


From:  Sebastian Piu 
Reply-To:  
Date:  Thursday, February 11, 2016 at 1:19 PM
To:  "Shixiong (Ryan) Zhu" 
Cc:  Sebastian Piu , "user @spark"

Subject:  Re: Skip empty batches - spark streaming

> 
> Yes, and as far as I recall it also has partitions (empty) which screws up the
> isEmpty call if the rdd has been transformed down the line. I will have a look
> tomorrow at the office and see if I can collaborate
> 
> On 11 Feb 2016 9:14 p.m., "Shixiong(Ryan) Zhu" 
> wrote:
>> Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty. Feel
>> free to send a PR to improve it.
>> 
>> On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu 
>> wrote:
>>> 
>>> I'm using the Kafka direct stream api but I can have a look on extending it
>>> to have this behaviour
>>> 
>>> Thanks!
>>> 
>>> On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" 
>>> wrote:
 Are you using a custom input dstream? If so, you can make the `compute`
 method return None to skip a batch.
 
 On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu 
 wrote:
> 
> I was wondering if there is there any way to skip batches with zero events
> when streaming?
> By skip I mean avoid the empty rdd from being created at all?
 
>> 




off-heap certain operations

2016-02-11 Thread Ovidiu-Cristian MARCU
Hi,

Reading though the latest documentation for Memory management I can see that 
the parameter spark.memory.offHeap.enabled (true by default) is described with 
‘If true, Spark will attempt to use off-heap memory for certain operations’ [1].

Can you please describe the certain operations you are referring to?  

http://spark.apache.org/docs/latest/configuration.html#memory-management 


Thank!

Best,
Ovidiu

Re: Skip empty batches - spark streaming

2016-02-11 Thread Sebastian Piu
Thanks for clarifying Cody. I will extend the current behaviour for my use
case. If there is anything worth sharing I'll run it through the list

Cheers
On 11 Feb 2016 9:47 p.m., "Cody Koeninger"  wrote:

> Please don't change the behavior of DirectKafkaInputDStream.
> Returning an empty rdd is (imho) the semantically correct thing to do, and
> some existing jobs depend on that behavior.
>
> If it's really an issue for you, you can either override
> directkafkainputdstream, or just check isEmpty as the first thing you do
> with the rdd (before any transformations)
>
> In any recent version of spark, isEmpty on a KafkaRDD is a driver-side
> only operation that is basically free.
>
>
> On Thu, Feb 11, 2016 at 3:19 PM, Sebastian Piu 
> wrote:
>
>> Yes, and as far as I recall it also has partitions (empty) which screws
>> up the isEmpty call if the rdd has been transformed down the line. I will
>> have a look tomorrow at the office and see if I can collaborate
>> On 11 Feb 2016 9:14 p.m., "Shixiong(Ryan) Zhu" 
>> wrote:
>>
>>> Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty.
>>> Feel free to send a PR to improve it.
>>>
>>> On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu 
>>> wrote:
>>>
 I'm using the Kafka direct stream api but I can have a look on
 extending it to have this behaviour

 Thanks!
 On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" 
 wrote:

> Are you using a custom input dstream? If so, you can make the
> `compute` method return None to skip a batch.
>
> On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu <
> sebastian@gmail.com> wrote:
>
>> I was wondering if there is there any way to skip batches with zero
>> events when streaming?
>> By skip I mean avoid the empty rdd from being created at all?
>>
>
>
>>>
>


Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
I think SPARK_CLASSPATH is deprecated.

Can you show the command line launching your Spark job ?
Which Spark release do you use ?

Thanks



On Thu, Feb 11, 2016 at 5:38 PM, Charlie Wright 
wrote:

> built and installed hadoop with:
> mvn package -Pdist -DskipTests -Dtar
> mvn install -DskipTests
>
> built spark with:
> mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.8.0-SNAPSHOT -DskipTests clean
> package
>
> Where would I check the classpath? Is it the environment variable
> SPARK_CLASSPATH?
>
> Charles
>
> --
> Date: Thu, 11 Feb 2016 17:29:00 -0800
> Subject: Re: Building Spark with a Custom Version of Hadoop: HDFS
> ClassNotFoundException
> From: yuzhih...@gmail.com
> To: charliewri...@live.ca
> CC: d...@spark.apache.org
>
> Hdfs class is in hadoop-hdfs-XX.jar
>
> Can you check the classpath to see if the above jar is there ?
>
> Please describe the command lines you used for building hadoop / Spark.
>
> Cheers
>
> On Thu, Feb 11, 2016 at 5:15 PM, Charlie Wright 
> wrote:
>
> I am having issues trying to run a test job on a built version of Spark
> with a custom Hadoop JAR.
> My custom hadoop version runs without issues and I can run jobs from a
> precompiled version of Spark (with Hadoop) no problem.
>
> However, whenever I try to run the same Spark example on the Spark version
> with my custom hadoop JAR - I get this error:
> "Exception in thread "main" java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.Hdfs not found"
>
> Does anybody know why this is happening?
>
> Thanks,
> Charles.
>
>
>


Re: Building Spark with a Custom Version of Hadoop: HDFS ClassNotFoundException

2016-02-11 Thread Ted Yu
The Spark driver does not run on the YARN cluster in client mode, only the
Spark executors do.
Can you check YARN logs for the failed job to see if there was more clue ?

Does the YARN cluster run the customized hadoop or stock hadoop ?

Cheers

On Thu, Feb 11, 2016 at 5:44 PM, Charlie Wright 
wrote:

> ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
> yarn --deploy-mode client --driver-memory 4g --executor-memory
> 1664m --executor-cores 1 --queue default
> examples/target/spark-examples*.jar 10
>
> I am using the 1.6.0 release.
>
>
> Charles.
>
> --
> Date: Thu, 11 Feb 2016 17:41:54 -0800
> Subject: Re: Building Spark with a Custom Version of Hadoop: HDFS
> ClassNotFoundException
> From: yuzhih...@gmail.com
> To: charliewri...@live.ca; user@spark.apache.org
>
>
> I think SPARK_CLASSPATH is deprecated.
>
> Can you show the command line launching your Spark job ?
> Which Spark release do you use ?
>
> Thanks
>
>
>
> On Thu, Feb 11, 2016 at 5:38 PM, Charlie Wright 
> wrote:
>
> built and installed hadoop with:
> mvn package -Pdist -DskipTests -Dtar
> mvn install -DskipTests
>
> built spark with:
> mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.8.0-SNAPSHOT -DskipTests clean
> package
>
> Where would I check the classpath? Is it the environment variable
> SPARK_CLASSPATH?
>
> Charles
>
> --
> Date: Thu, 11 Feb 2016 17:29:00 -0800
> Subject: Re: Building Spark with a Custom Version of Hadoop: HDFS
> ClassNotFoundException
> From: yuzhih...@gmail.com
> To: charliewri...@live.ca
> CC: d...@spark.apache.org
>
> Hdfs class is in hadoop-hdfs-XX.jar
>
> Can you check the classpath to see if the above jar is there ?
>
> Please describe the command lines you used for building hadoop / Spark.
>
> Cheers
>
> On Thu, Feb 11, 2016 at 5:15 PM, Charlie Wright 
> wrote:
>
> I am having issues trying to run a test job on a built version of Spark
> with a custom Hadoop JAR.
> My custom hadoop version runs without issues and I can run jobs from a
> precompiled version of Spark (with Hadoop) no problem.
>
> However, whenever I try to run the same Spark example on the Spark version
> with my custom hadoop JAR - I get this error:
> "Exception in thread "main" java.lang.RuntimeException:
> java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.Hdfs not found"
>
> Does anybody know why this is happening?
>
> Thanks,
> Charles.
>
>
>
>


Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

2016-02-11 Thread p pathiyil
With this setting, I can see that the next job is being executed before the
previous one is finished. However, the processing of the 'hot' partition
eventually hogs all the concurrent jobs. If there was a way to restrict
jobs to be one per partition, then this setting would provide the
per-partition isolation.

Is there anything in the framework which would give control over that
aspect ?

Thanks.


On Thu, Feb 11, 2016 at 9:55 PM, Cody Koeninger  wrote:

> spark.streaming.concurrentJobs
>
>
> see e.g. 
> http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming
>
>
> On Thu, Feb 11, 2016 at 9:33 AM, p pathiyil  wrote:
>
>> Thanks for the response Cody.
>>
>> The producers are out of my control, so can't really balance the incoming
>> content across the various topics and partitions. The number of topics and
>> partitions are quite large and the volume across then not very well known
>> ahead of time. So it is quite hard to segregate low and high volume topics
>> in to separate driver programs.
>>
>> Will look at shuffle / repartition.
>>
>> Could you share the setting for starting another batch in parallel ? It
>> might be ok to call the 'save' of the processed messages out of order if
>> that is the only consequence of this setting.
>>
>> When separate DStreams are created per partition (and if union() is not
>> called on them), what aspect of the framework still ties the scheduling of
>> jobs across the partitions together ? Asking this to see if creating
>> multiple threads in the driver and calling createDirectStream per partition
>> in those threads can provide isolation.
>>
>>
>>
>> On Thu, Feb 11, 2016 at 8:14 PM, Cody Koeninger 
>> wrote:
>>
>>> The real way to fix this is by changing partitioning, so you don't have
>>> a hot partition.  It would be better to do this at the time you're
>>> producing messages, but you can also do it with a shuffle / repartition
>>> during consuming.
>>>
>>> There is a setting to allow another batch to start in parallel, but
>>> that's likely to have unintended consequences.
>>>
>>> On Thu, Feb 11, 2016 at 7:59 AM, p pathiyil  wrote:
>>>
 Hi,

 I am looking at a way to isolate the processing of messages from each
 Kafka partition within the same driver.

 Scenario: A DStream is created with the createDirectStream call by
 passing in a few partitions. Let us say that the streaming context is
 defined to have a time duration of 2 seconds. If the processing of messages
 from a single partition takes more than 2 seconds (while all the others
 finish much quicker), it seems that the next set of jobs get scheduled only
 after the processing of that last partition. This means that the delay is
 effective for all partitions and not just the partition that was truly the
 cause of the delay. What I would like to do is to have the delay only
 impact the 'slow' partition.

 Tried to create one DStream per partition and then do a union of all
 partitions, (similar to the sample in
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
 but that didn't seem to help.

 Please suggest the correct approach to solve this issue.

 Thanks,
 Praveen.

>>>
>>>
>>
>


SparkSQL parallelism

2016-02-11 Thread Madabhattula Rajesh Kumar
Hi,

I have a spark cluster with One Master and 3 worker nodes. I have written a
below code to fetch the records from oracle using sparkSQL

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val employees = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:oracle:thin:@:1525:SID",
"dbtable" -> "(select * from employee where name like '%18%')",
"user" -> "username",
"password" -> "password")).load

I have a submitted this job to spark cluster using spark-submit command.



*Looks like, All 3 workers are executing same query and fetching same data.
It means, it is making 3 jdbc calls to oracle.*
*How to make this code to make a single jdbc call to oracle(In case of more
than one worker) ?*

Please help me to resolve this use case

Regards,
Rajesh


Re: Scala types to StructType

2016-02-11 Thread Yogesh Mahajan
CatatlystTypeConverters.scala has all types of utility methods to convert
from Scala to row and vice a versa.

On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan 
wrote:

> I had the same issue. I resolved it in Java, but I am pretty sure it would
> work with scala too. Its kind of a gross hack. But what I did is say I had
> a table in Mysql with 1000 columns
> what is did is that I threw a jdbc query to extracted the schema of the
> table. I stored that schema and wrote a map function to create StructFields
> using structType and Row.Factory. Then I took that table loaded as a
> dataFrame, event though it had a schema. I converted that data frame into
> an RDD, this is when it lost the schema. Then performed something using
> that RDD and then converted back that RDD with the structfield.
> If your source is structured type then it would be better if you can load
> it directly as a DF that way you can preserve the schema. However, in your
> case you should do something like this
> List fields = new ArrayList
> for(keys in MAP)
>  fields.add(DataTypes.createStructField(keys, DataTypes.StringType, true
> ));
>
> StrructType schemaOfDataFrame = DataTypes.createStructType(conffields);
>
> sqlcontext.createDataFrame(rdd, schemaOfDataFrame);
>
> This is how I would do it to make it in Java, not sure about scala syntax.
> Please tell me if that helped.
>
> On Feb 11, 2016, at 7:20 AM, Fabian Böhnlein 
> wrote:
>
> Hi all,
>
> is there a way to create a Spark SQL Row schema based on Scala data types
> without creating a manual mapping?
>
> That's the only example I can find which doesn't require
> spark.sql.types.DataType already as input, but it requires to define them
> as Strings.
>
> * val struct = (new StructType)*   .add("a", "int")*   .add("b", "long")*   
> .add("c", "string")
>
>
>
> Specifically I have an RDD where each element is a Map of 100s of
> variables with different data types which I want to transform to a DataFrame
> where the keys should end up as the column names:
>
> Map ("Amean" -> 20.3, "Asize" -> 12, "Bmean" -> )
>
>
> Is there a different possibility than building a mapping from the values'
> .getClass to the Spark SQL DataTypes?
>
>
> Thanks,
> Fabian
>
>
>
>
>


Re: Scala types to StructType

2016-02-11 Thread Yogesh Mahajan
Right, Thanks Ted.

On Fri, Feb 12, 2016 at 10:21 AM, Ted Yu  wrote:

> Minor correction: the class is CatalystTypeConverters.scala
>
> On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan 
> wrote:
>
>> CatatlystTypeConverters.scala has all types of utility methods to convert
>> from Scala to row and vice a versa.
>>
>>
>> On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan 
>> wrote:
>>
>>> I had the same issue. I resolved it in Java, but I am pretty sure it
>>> would work with scala too. Its kind of a gross hack. But what I did is say
>>> I had a table in Mysql with 1000 columns
>>> what is did is that I threw a jdbc query to extracted the schema of the
>>> table. I stored that schema and wrote a map function to create StructFields
>>> using structType and Row.Factory. Then I took that table loaded as a
>>> dataFrame, event though it had a schema. I converted that data frame into
>>> an RDD, this is when it lost the schema. Then performed something using
>>> that RDD and then converted back that RDD with the structfield.
>>> If your source is structured type then it would be better if you can
>>> load it directly as a DF that way you can preserve the schema. However, in
>>> your case you should do something like this
>>> List fields = new ArrayList
>>> for(keys in MAP)
>>>  fields.add(DataTypes.createStructField(keys, DataTypes.StringType, true
>>> ));
>>>
>>> StrructType schemaOfDataFrame = DataTypes.createStructType(conffields);
>>>
>>> sqlcontext.createDataFrame(rdd, schemaOfDataFrame);
>>>
>>> This is how I would do it to make it in Java, not sure about scala
>>> syntax. Please tell me if that helped.
>>>
>>> On Feb 11, 2016, at 7:20 AM, Fabian Böhnlein 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> is there a way to create a Spark SQL Row schema based on Scala data
>>> types without creating a manual mapping?
>>>
>>> That's the only example I can find which doesn't require
>>> spark.sql.types.DataType already as input, but it requires to define them
>>> as Strings.
>>>
>>> * val struct = (new StructType)*   .add("a", "int")*   .add("b", "long")*   
>>> .add("c", "string")
>>>
>>>
>>>
>>> Specifically I have an RDD where each element is a Map of 100s of
>>> variables with different data types which I want to transform to a DataFrame
>>> where the keys should end up as the column names:
>>>
>>> Map ("Amean" -> 20.3, "Asize" -> 12, "Bmean" -> )
>>>
>>>
>>> Is there a different possibility than building a mapping from the
>>> values' .getClass to the Spark SQL DataTypes?
>>>
>>>
>>> Thanks,
>>> Fabian
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Scala types to StructType

2016-02-11 Thread Ted Yu
Minor correction: the class is CatalystTypeConverters.scala

On Thu, Feb 11, 2016 at 8:46 PM, Yogesh Mahajan 
wrote:

> CatatlystTypeConverters.scala has all types of utility methods to convert
> from Scala to row and vice a versa.
>
>
> On Fri, Feb 12, 2016 at 12:21 AM, Rishabh Wadhawan 
> wrote:
>
>> I had the same issue. I resolved it in Java, but I am pretty sure it
>> would work with scala too. Its kind of a gross hack. But what I did is say
>> I had a table in Mysql with 1000 columns
>> what is did is that I threw a jdbc query to extracted the schema of the
>> table. I stored that schema and wrote a map function to create StructFields
>> using structType and Row.Factory. Then I took that table loaded as a
>> dataFrame, event though it had a schema. I converted that data frame into
>> an RDD, this is when it lost the schema. Then performed something using
>> that RDD and then converted back that RDD with the structfield.
>> If your source is structured type then it would be better if you can load
>> it directly as a DF that way you can preserve the schema. However, in your
>> case you should do something like this
>> List fields = new ArrayList
>> for(keys in MAP)
>>  fields.add(DataTypes.createStructField(keys, DataTypes.StringType, true
>> ));
>>
>> StrructType schemaOfDataFrame = DataTypes.createStructType(conffields);
>>
>> sqlcontext.createDataFrame(rdd, schemaOfDataFrame);
>>
>> This is how I would do it to make it in Java, not sure about scala
>> syntax. Please tell me if that helped.
>>
>> On Feb 11, 2016, at 7:20 AM, Fabian Böhnlein 
>> wrote:
>>
>> Hi all,
>>
>> is there a way to create a Spark SQL Row schema based on Scala data types
>> without creating a manual mapping?
>>
>> That's the only example I can find which doesn't require
>> spark.sql.types.DataType already as input, but it requires to define them
>> as Strings.
>>
>> * val struct = (new StructType)*   .add("a", "int")*   .add("b", "long")*   
>> .add("c", "string")
>>
>>
>>
>> Specifically I have an RDD where each element is a Map of 100s of
>> variables with different data types which I want to transform to a DataFrame
>> where the keys should end up as the column names:
>>
>> Map ("Amean" -> 20.3, "Asize" -> 12, "Bmean" -> )
>>
>>
>> Is there a different possibility than building a mapping from the values'
>> .getClass to the Spark SQL DataTypes?
>>
>>
>> Thanks,
>> Fabian
>>
>>
>>
>>
>>
>


Re: Computing hamming distance over large data set

2016-02-11 Thread Karl Higley
Hi,

It sounds like you're trying to solve the approximate nearest neighbor
(ANN) problem. With a large dataset, parallelizing a brute force O(n^2)
approach isn't likely to help all that much, because the number of pairwise
comparisons grows quickly as the size of the dataset increases. I'd look at
ways to avoid computing the similarity between all pairs, like
locality-sensitive hashing. (Unfortunately Spark doesn't yet support LSH --
it's currently slated for the Spark 2.0.0 release, but AFAIK development on
it hasn't started yet.)

There are a bunch of Python libraries that support various approaches to
the ANN problem (including LSH), though. It sounds like you need fast
lookups, so you might check out https://github.com/spotify/annoy. For other
alternatives, see this performance comparison of Python ANN libraries:
https://github.com/erikbern/ann-benchmarks.

Hope that helps,
Karl

On Wed, Feb 10, 2016 at 10:29 PM rokclimb15  wrote:

> Hi everyone, new to this list and Spark, so I'm hoping someone can point me
> in the right direction.
>
> I'm trying to perform this same sort of task:
>
> http://stackoverflow.com/questions/14925151/hamming-distance-optimization-for-mysql-or-postgresql
>
> and I'm running into the same problem - it doesn't scale.  Even on a very
> fast processor, MySQL pegs out one CPU core at 100% and takes 8 hours to
> find a match with 30 million+ rows.
>
> What I would like to do is to load this data set from MySQL into Spark and
> compute the Hamming distance using all available cores, then select the
> rows
> matching a maximum distance.  I'm most familiar with Python, so would
> prefer
> to use that.
>
> I found an example of loading data from MySQL
>
>
> http://blog.predikto.com/2015/04/10/using-the-spark-datasource-api-to-access-a-database/
>
> I found a related DataFrame commit and docs, but I'm not exactly sure how
> to
> put this all together.
>
>
> https://mail-archives.apache.org/mod_mbox/spark-commits/201505.mbox/%3c707d439f5fcb478b99aa411e23abb...@git.apache.org%3E
>
>
> http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.bitwiseXOR
>
> Could anyone please point me to a similar example I could follow as a Spark
> newb to try this out?  Is this even worth attempting, or will it similarly
> fail performance-wise?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Computing-hamming-distance-over-large-data-set-tp26202.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


spark thrift server transport protocol

2016-02-11 Thread Sanjeev Verma
I am running spark thrift server to run query over hive, how do I can know
which transport protocol my client and server using currently.

Thanks


Re: Dataframes

2016-02-11 Thread Rishabh Wadhawan
Hi Gaurav
I am not sure what you are trying to do here as you are naming two data frames 
with the same name which would be a compilation error in java. 
However, after trying to see what you are asking, as of what I understand your 
question is.
You can do something like this;


> SqlContext sqlContext = new SQlContext(sc) // This will create a sqlcontext 
> object to leverage spark-sql api.
 DataFrame firstDataFrame = 
sqlContext.read().jdbc(“url”,”(select * from employee) as emp”) // See here I 
am throwing a sub query to select all the employee
/* Or I can also have a subquery like this "(select * from 
employee as e where e.empid = 1) as emp". The subquery that you throw as a part 
of   
 * parameters would be running inside the database that you are 
connecting to using the url specified in the query.
 DataFrame secondDataFrame =  
sqlContext.read().jdbc(“url”,”(select * from department as d where d.deptid = 
2) as dept”)
 firstDataFrame.show();
secondDataFrame.show();

Please tell me if I cleared you query or I didn’t then please clarify your 
question. Please ask me if you have any other question too. 
Thanks
Regards
Rishabh Wadhawan



> On Feb 11, 2016, at 9:47 AM, Gaurav Agarwal  wrote:
> 
> SqlContext sContext = new SQlContext(sc)
> DataFrame df = sContext.load("jdbc","select * from employee"); // These 
> queries will be the Map with driver.
> DataFrame df = sContext.load("jdbc","select * from Dept");



Re: Spark execuotr Memory profiling

2016-02-11 Thread Rishabh Wadhawan
Hi All
Please check this jira ticket regarding the issue. I was having the same issue 
with shuffling. Seems like the shuffling memory max is 2g.
 https://issues.apache.org/jira/browse/SPARK-5928 

> On Feb 11, 2016, at 9:08 AM, arun.bong...@cognizant.com wrote:
> 
> Hi All,
> 
> Even i have same issues. 
> 
> EMR conf is 3 node cluster with m3.2xlarge.
> 
> i'm tyring to read 100Gb file in spark-sql
> 
> i have set below on spark
> 
> export SPARK_EXECUTOR_MEMORY=4G
> export SPARK_DRIVER_MEMORY=12G
> 
> export SPARK_EXECUTOR_INSTANCES=16
> export SPARK_EXECUTOR_CORES=16
> 
> spark.kryoserializer.buffer.max 2000m
> spark.driver.maxResultSize 0
> 
>  -XX:MaxPermSize=1024M
> 
> 
> PFB the error:
> 
> 16/02/11 15:32:00 WARN DFSClient: DFSOutputStream ResponseProcessor exception 
>  for block BP-1257713490-xx.xx.xx.xx-1455121562682:blk_1073742405_10984
> java.io.EOFException: Premature EOF: no length prefix available
> at 
> org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2280)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:745)
> 
> Kindly help me understand the conf.
> 
> 
> Thanks in advance.
> 
> Regards
> Arun.
> 
> From: Kuchekar [kuchekar.nil...@gmail.com]
> Sent: 11 February 2016 09:42
> To: Nirav Patel
> Cc: spark users
> Subject: Re: Spark execuotr Memory profiling
> 
> Hi Nirav,
> 
>   I faced similar issue with Yarn, EMR 1.5.2 and following 
> Spark Conf helped me. You can set the values accordingly
> conf= 
> (SparkConf().set("spark.master","yarn-client").setAppName("HalfWay").set("spark.driver.memory",
>  "15G").set("spark.yarn.am.memory","15G"))
> 
> conf=conf.set("spark.driver.maxResultSize","10G").set("spark.storage.memoryFraction","0.6").set("spark.shuffle.memoryFraction","0.6").set("spark.yarn.executor.memoryOverhead","4000")
> 
> conf = conf.set("spark.executor.cores","4").set("spark.executor.memory", 
> "15G").set("spark.executor.instances","6")
> 
> 
> Is it also possible to use reduceBy in place of groupBy that might help the 
> shuffling too. 
> 
> 
> Kuchekar, Nilesh
> 
> On Wed, Feb 10, 2016 at 8:09 PM, Nirav Patel  >
>  wrote:
> We have been trying to solve memory issue with a spark job that processes 
> 150GB of data (on disk). It does a groupBy operation; some of the executor 
> will receive somehwere around (2-4M scala case objects) to work with. We are 
> using following spark config:
> 
> "executorInstances": "15",
> 
>  "executorCores": "1", (we reduce it to one so single task gets all the 
> executorMemory! at least that's the assumption here)
> 
>  "executorMemory": "15000m",
> 
>  "minPartitions": "2000",
> 
>  "taskCpus": "1", 
> 
>  "executorMemoryOverhead": "1300",
> 
>  "shuffleManager": "tungsten-sort",
> 
> 
>   "storageFraction": "0.4"
> 
> 
> 
> This is a snippet of what we see in spark UI for a Job that fails. 
> 
> This is a stage of this job that fails.
> 
> 
> Stage Id  Pool Name   Description Submitted   Duration
> Tasks: Succeeded/Total  Input   Output  Shuffle Read ▾  Shuffle Write   
> Failure Reason
> 5 (retry 15)  prod 
> 
>map at SparkDataJobs.scala:210 
> +details
> 2016/02/09 21:30:06   13 min  
> 130/389 (16 failed)
> 1982.6 MB 818.7 MBorg.apache.spark.shuffle.FetchFailedException: 
> Error in opening 
> FileSegmentManagedBuffer{file=/tmp/hadoop/nm-local-dir/usercache/fasd/appcache/application_1454975800192_0447/blockmgr-abb77b52-9761-457a-b67d-42a15b975d76/0c/shuffle_0_39_0.data,
>  offset=11421300, length=2353}
> This is one of the single task attempt from above stage that threw OOM
> 
> 
> 2 22361   0   FAILED  PROCESS_LOCAL   38 / nd1.mycom.local
> 2016/02/09 22:10:42 5.2 min 1.6 min 7.4 MB / 375509 
> java.lang.OutOfMemoryError: Java heap space+details
> java.lang.OutOfMemoryError: Java heap space
>   at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
>   at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
>   at 
> org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
>   at 
> org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:203)
>   at 
> org.apache.spark.util.SizeEstimator$$anonfun$visitSingleObject$1.apply(SizeEstimator.scala:202)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:202)
>   at 
> org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
>   at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
>   at 
> 

Re: Kafka + Spark 1.3 Integration

2016-02-11 Thread Nipun Arora
Hi ,

Thanks for the explanation and the example link. Got it working.
A follow up question. In Kafka one can define properties as follows:

Properties props = new Properties();
props.put("zookeeper.connect", zookeeper);
props.put("group.id", groupId);
props.put("zookeeper.session.timeout.ms", "500");
props.put("zookeeper.sync.time.ms", "250");
props.put("auto.commit.interval.ms", "1000");


How can I do the same for the receiver inside spark-streaming for Spark V1.3.1


Thanks

Nipun



On Wed, Feb 10, 2016 at 3:59 PM Cody Koeninger  wrote:

> It's a pair because there's a key and value for each message.
>
> If you just want a single topic, put a single topic in the map of topic ->
> number of partitions.
>
> See
>
>
> https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaKafkaWordCount.java
>
>
> On Wed, Feb 10, 2016 at 1:28 PM, Nipun Arora 
> wrote:
>
>> Hi,
>>
>> I am trying some basic integration and was going through the manual.
>>
>> I would like to read from a topic, and get a JavaReceiverInputDStream
>>  for messages in that topic. However the example is of
>> JavaPairReceiverInputDStream<>. How do I get a stream for only a single
>> topic in Java?
>>
>> Reference Page:
>> https://spark.apache.org/docs/1.3.0/streaming-kafka-integration.html
>>
>>  import org.apache.spark.streaming.kafka.*;
>>
>>  JavaPairReceiverInputDStream kafkaStream =
>>  KafkaUtils.createStream(streamingContext,
>>  [ZK quorum], [consumer group id], [per-topic number of Kafka partitions 
>> to consume]);
>>
>>
>> Also in the example above what does  signify?
>>
>> Thanks
>> Nipun
>>
>
>


Re: spark thrift server transport protocol

2016-02-11 Thread Ted Yu
>From the head of HiveThriftServer2 :

 * The main entry point for the Spark SQL port of HiveServer2.  Starts up a
`SparkSQLContext` and a
 * `HiveThriftServer2` thrift server.

Looking at HiveServer2.java from Hive, looks like it uses thrift protocol.

FYI

On Thu, Feb 11, 2016 at 9:34 AM, Sanjeev Verma 
wrote:

> I am running spark thrift server to run query over hive, how do I can know
> which transport protocol my client and server using currently.
>
> Thanks
>


cache DataFrame

2016-02-11 Thread Gaurav Agarwal
Hi

When the dataFrame will load the table into memory when it reads from
HIVe/Phoenix or from any database.
These are two points where need one info , when tables will be loaded into
memory or cached when at  point 1 or point 2 below.

 1. DataFrame df = sContext.load("jdbc","(select * from employee) as
employee);

2.sContext.sql("select * from employee where wmpid="!");


Re: Dataframes

2016-02-11 Thread Ted Yu
bq. Whether sContext(SQlCOntext) will help to query in both the dataframes
and will it decide on which dataframe to query for .

Can you clarify what you were asking ?

The queries would be carried out on respective DataFrame's as shown in your
snippet.

On Thu, Feb 11, 2016 at 8:47 AM, Gaurav Agarwal 
wrote:

> Thanks
>
> That's what i tried to do , but for these two dataframes sqlContext is
> only one .
>
> DataFrame tableA = sqlContext.read().jdbc(url,"tableA",prop);
>
> DataFrame tableB = sqlContext.read().jdbc(url,"tableB",prop);
>
>
> When i will say like this
>
>
> SqlContext sContext = new SQlContext(sc)
>
> DataFrame df = sContext.load("jdbc","select * from employee"); // These
> queries will be the Map with driver.
>
> DataFrame df = sContext.load("jdbc","select * from Dept");
>
>
> DataFrame filteredCriteria = sContext.sql("Select * from employee where
> empId="1" ");
>
> DataFrame filteredCriteria2 = sContext.sql("select * from Dept where
> deptid="2" ");
>
>
> List listEmployee = filteredCriteria.collectASList();
>
> List listDept = filteredCriteria2 .collectASList();
>
>
> will this work in this scenario Whether sContext(SQlCOntext) will help to
> query in both the dataframes and will it decide on which dataframe to query
> for .
>
>
> if any more question then let me know.
>
>
> Thanks
>
>
>
>
> On Thu, Feb 11, 2016 at 7:41 PM, Prashant Verma <
> prashant.ve...@ericsson.com> wrote:
>
>> Hi Gaurav,
>>
>> You can try something like this.
>>
>>
>>
>> SparkConf conf = new SparkConf();
>>
>> JavaSparkContext sc = new JavaSparkContext(conf);
>>
>> SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
>>
>> Class.forName("com.mysql.jdbc.Driver");
>>
>> String url="url";
>>
>> Properties prop = new java.util.Properties();
>>
>> prop.setProperty("user","user");
>>
>> prop.setProperty("password","password");
>>
>> DataFrame tableA = sqlContext.read().jdbc(url,"tableA",prop);
>>
>> DataFrame tableB = sqlContext.read().jdbc(url,"tableB",prop);
>>
>>
>>
>> Hope this helps.
>>
>>
>>
>> Thanks,
>>
>> Prashant
>>
>>
>>
>>
>>
>>
>>
>> *From:* Gaurav Agarwal [mailto:gaurav130...@gmail.com]
>> *Sent:* Thursday, February 11, 2016 7:35 PM
>> *To:* user@spark.apache.org
>> *Subject:* Dataframes
>>
>>
>>
>> Hi
>>
>> Can we load 5 data frame for 5 tables in one spark context.
>> I am asking why because we have to give
>> Map options= new hashmap();
>>
>> Options.put(driver,"");
>> Options.put(URL,"");
>> Options.put(dbtable,"");
>>
>> I can give only table query at time in dbtable options .
>> How will I register multiple queries and dataframes
>>
>> Thankw
>> with all table.
>>
>> Thanks
>> +
>>
>
>


How to parallel read files in a directory

2016-02-11 Thread Junjie Qian
Hi all,
I am working with Spark 1.6, scala and have a big dataset divided into several 
small files.
My question is: right now the read operation takes really long time and often 
has RDD warnings. Is there a way I can read the files in parallel, that all 
nodes or workers read the file at the same time?
Many thanksJunjie 

ApacheCon NA 2016 - Important Dates!!!

2016-02-11 Thread Melissa Warnkin
 Hello everyone!
I hope this email finds you well.  I hope everyone is as excited about 
ApacheCon as I am!
I'd like to remind you all of a couple of important dates, as well as ask for 
your assistance in spreading the word! Please use your social media platform(s) 
to get the word out! The more visibility, the better ApacheCon will be for 
all!! :)
CFP Close: February 12, 2016CFP Notifications: February 29, 2016Schedule 
Announced: March 3, 2016
To submit a talk, please visit:  
http://events.linuxfoundation.org/events/apache-big-data-north-america/program/cfp

Link to the main site can be found here:  
http://events.linuxfoundation.org/events/apache-big-data-north-america

Apache: Big Data North America 2016 Registration Fees:
Attendee Registration Fee: US$599 through March 6, US$799 through April 10, 
US$999 thereafterCommitter Registration Fee: US$275 through April 10, US$375 
thereafterStudent Registration Fee: US$275 through April 10, $375 thereafter
Planning to attend ApacheCon North America 2016 May 11 - 13, 2016? There is an 
add-on option on the registration form to join the conference for a discounted 
fee of US$399, available only to Apache: Big Data North America attendees.
So, please tweet away!!
I look forward to seeing you in Vancouver! Have a groovy day!!
~Melissaon behalf of the ApacheCon Team






Re: LogisticRegressionModel not able to load serialized model from S3

2016-02-11 Thread Utkarsh Sengar
The problem turned out to be corrupt parquet data, the error was a bit
misleading by spark though.

On Mon, Feb 8, 2016 at 3:41 PM, Utkarsh Sengar 
wrote:

> I am storing a model in s3 in this path:
> "bucket_name/p1/models/lr/20160204_0410PM/ser" and the structure of the
> saved dir looks like this:
>
> 1. bucket_name/p1/models/lr/20160204_0410PM/ser/data -> _SUCCESS,
> _metadata, _common_metadata
> and part-r-0-ebd3dc3c-1f2c-45a3-8793-c8f0cb8e7d01.gz.parquet
> 2. bucket_name/p1/models/lr/20160204_0410PM/ser/metadata/ -> _SUCCESS
> and part-0
>
> So when I try to load "bucket_name/p1/models/lr/20160204_0410PM/ser"
> for LogisticRegressionModel:
>
> LogisticRegressionModel model = LogisticRegressionModel.load(sc.sc(),
> "s3n://bucket_name/p1/models/lr/20160204_0410PM/ser");
>
> I get this error consistently. I have permission to the bucket and I am
> able to other data using textFiles()
>
> Exception in thread "main" org.apache.spark.SparkException: Job aborted
> due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent
> failure: Lost task 0.3 in stage 2.0 (TID 5, mesos-slave12):
> java.lang.NullPointerException
> at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
> at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
>   at
> parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
> at
> parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.(SqlNewHadoopRDD.scala:153)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:124)
> at
> org.apache.spark.sql.sources.SqlNewHadoopRDD.compute(SqlNewHadoopRDD.scala:66)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> Any pointers of whats wrong?
>
> --
> -Utkarsh
>



-- 
Thanks,
-Utkarsh


Re: Spark Certification

2016-02-11 Thread Prem Sure
I did recently. it includes MLib & Graphx too and I felt like exam content
covered all topics till 1.3 and not the > 1.3 versions of spark.


On Thu, Feb 11, 2016 at 9:39 AM, Janardhan Karri 
wrote:

> I am planning to do that with databricks
> http://go.databricks.com/spark-certified-developer
>
> Regards,
> Janardhan
>
> On Thu, Feb 11, 2016 at 2:00 PM, Timothy Spann 
> wrote:
>
>> I was wondering that as well.
>>
>> Also is it fully updated for 1.6?
>>
>> Tim
>> http://airisdata.com/
>> http://sparkdeveloper.com/
>>
>>
>> From: naga sharathrayapati 
>> Date: Wednesday, February 10, 2016 at 11:36 PM
>> To: "user@spark.apache.org" 
>> Subject: Spark Certification
>>
>> Hello All,
>>
>> I am planning on taking Spark Certification and I was wondering If one
>> has to be well equipped with  MLib & GraphX as well or not ?
>>
>> Please advise
>>
>> Thanks
>>
>
>