How is the predict() working in LogisticRegressionModel?

2015-11-13 Thread MEETHU MATHEW
Hi all,Can somebody point me to the implementation of predict() in 
LogisticRegressionModel of spark mllib? I could find a predictPoint() in the 
class LogisticRegressionModel, but where is predict()?
 Thanks & Regards,  Meethu M

Re: how to run unit test for specific component only

2015-11-13 Thread Steve Loughran
try:

mvn test -pl sql  -DwildcardSuites=org.apache.spark.sql -Dtest=none




On 12 Nov 2015, at 03:13, weoccc > 
wrote:

Hi,

I am wondering how to run unit test for specific spark component only.

mvn test -DwildcardSuites="org.apache.spark.sql.*" -Dtest=none

The above command doesn't seem to work. I'm using spark 1.5.

Thanks,

Weide




Re: large, dense matrix multiplication

2015-11-13 Thread Sabarish Sasidharan
We have done this by blocking but without using BlockMatrix. We used our
own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What
is the size of your block? How much memory are you giving to the executors?
I assume you are running on YARN, if so you would want to make sure your
yarn executor memory overhead is set to a higher value than default.

Just curious, could you also explain why you need matrix multiplication
with transpose? Smells like similarity computation.

Regards
Sab

On Thu, Nov 12, 2015 at 7:27 PM, Eilidh Troup  wrote:

> Hi,
>
> I’m trying to multiply a large squarish matrix with its transpose.
> Eventually I’d like to work with matrices of size 200,000 by 500,000, but
> I’ve started off first with 100 by 100 which was fine, and then with 10,000
> by 10,000 which failed with an out of memory exception.
>
> I used MLlib and BlockMatrix and tried various block sizes, and also tried
> switching disk serialisation on.
>
> We are running on a small cluster, using a CSV file in HDFS as the input
> data.
>
> Would anyone with experience of multiplying large, dense matrices in spark
> be able to comment on what to try to make this work?
>
> Thanks,
> Eilidh
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++


Re: HiveServer2 Thrift OOM

2015-11-13 Thread Steve Loughran
looks suspiciously like some thrift transport unmarshalling problem, THRIFT-2660

Spark 1.5 uses hive 1.2.1; it should have the relevant thrift JAR too. 
Otherwise, you could play with thrift JAR versions yourself —maybe it will 
work, maybe not...

On 13 Nov 2015, at 00:29, Yana Kadiyska 
> wrote:

Hi folks, I'm starting a HiveServer2 from a HiveContext 
(HiveThriftServer2.startWithContext(hiveContext))  and then connecting to it 
via beenline. On the server side, I see the below error  which I think is 
related to https://issues.apache.org/jira/browse/HIVE-6468

But I'd like to know:

1. why I see it (I'm using bin/beeline from spark to connect, not http)
2. should I be dropping any hive-site or hive-default files in conf/ -- 
Hive-6468 talks about hive.server2.sasl.message.limit but I can't see any 
documentation on where this setting would go or what's a reasonable value (im 
trying to do a light-weight deployment and have not needed hive-site.xml so 
far...)

Advice on how to get rid of the below exception much appreciated



Exception in thread "pool-17-thread-2" java.lang.OutOfMemoryError: Java heap 
space
at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:181)
at 
org.apache.thrift.transport.TSaslServerTransport.handleSaslStartMessage(TSaslServerTransport.java:125)
at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:253)
at 
org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
at 
org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:189)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
E


​



Joining HDFS and JDBC data sources - benchmarks

2015-11-13 Thread Eran Medan
Hi
I'm looking for some benchmarks on joining data frames where most of the
data is in HDFS (e.g. in parquet) and some "reference" or "metadata" is
still in RDBMS. I am only looking at the very first join before any caching
happens, and I assume there will be loss of parallelization because JDBCRDD
is probably bottlenecked on the max amount of parallel connection the
database server can hold.

Are there any measurements / benchmarks that anyone did?


ᐧ


a way to allow spark job to continue despite task failures?

2015-11-13 Thread Nicolae Marasoiu
Hi,


I know a task can fail 2 times and only the 3rd breaks the entire job.

I am good with this number of attempts.

I would like that after trying a task 3 times, it continues with the other 
tasks.

The job can be "failed", but I want all tasks run.

Please see my use case.


I read a hadoop input set, and some gzip files are incomplete. I would like to 
just skip them and the only way I see is to tell Spark to ignore some tasks 
permanently failing, if it is possible. With traditional hadoop map-reduce this 
was possible using mapred.max.map.failures.percent.


Do map-reduce params like mapred.max.map.failures.percent apply to Spark/YARN 
map-reduce jobs ?


I edited $HADOOP_CONF_DIR/mapred-site.xml and added 
mapred.max.map.failures.percent=30 but does not seem to apply, job still failed 
after 3 task attempt fails.


Should Spark transmit this parameter? Or the mapred.* do not apply?

Do other hadoop parameters (e.g. the ones involved in the input reading, not in 
the "processing" or "application" like this max.map.failures) - are others 
taken into account and transmitted? I saw that it should scan HADOOP_CONF_DIR 
and forward those, but I guess this does not apply to any parameter, since 
Spark has its own distribution & DAG stages processing logic, which just 
happens to have a YARN implementation.


Do you know a way to do this in Spark - to ignore a predefined number of tasks 
fail, but allow the job to continue? This way I could see all the faulty input 
files in one job run, delete them all and continue with the rest.


Just to mention, doing a manual gzip -t on top of hadoop cat is infeasible and 
map-reduce is way faster to scan the 15K files worth 70GB (its doing 25M/s per 
node), while the old style hadoop cat is doing much less.


Thanks,

Nicu


Please add us to the Powered by Spark page

2015-11-13 Thread Sujit Pal
Hello,

We have been using Spark at Elsevier Labs for a while now. Would love to be
added to the “Powered By Spark” page.

Organization Name: Elsevier Labs
URL: http://labs.elsevier.com
Spark components: Spark Core, Spark SQL, MLLib, GraphX.
Use Case: Building Machine Reading Pipeline, Knowledge Graphs, Content as a
Service, Content and Event Analytics, Content/Event based Predictive Models
and Big Data Processing. We use Scala and Python over Databricks Notebooks
for most of our work.

Thanks very much,

Sujit Pal
Technical Research Director
Elsevier Labs
sujit@elsevier.com


Re: spark 1.4 GC issue

2015-11-13 Thread Gaurav Kumar
Please have a look at http://spark.apache.org/docs/1.4.0/tuning.html
You may also want to use the latest build of JDK 7/8 and use G1GC instead.
I saw considerable reductions in GC time just by doing that.
Rest of the tuning parameters are better explained in the link above.



Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91 9953294125

On Fri, Nov 13, 2015 at 8:01 PM, Renu Yadav  wrote:

> am using spark 1.4 and my application is taking much time in GC around
> 60-70% of time for each task
>
> I am using parallel GC.
> please help somebody as soon as possible.
>
> Thanks,
> Renu
>


Re: Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-13 Thread Cody Koeninger
Unless you change maxRatePerPartition, a batch is going to contain all of
the offsets from the last known processed to the highest available.

Offsets are not time-based, and Kafka's time-based api currently has very
poor granularity (it's based on filesystem timestamp of the log segment).
There's a kafka improvement proposal to add time-based indexing, but I
wouldn't expect it soon.

Basically, if you want batches to relate to time even while your spark job
is down, you need an external process to index Kafka and do some custom
work to use that index to generate batches.

Or (preferably) embed a time in your message, and do any time-based
calculations using that time, not time of processing.

On Fri, Nov 13, 2015 at 4:36 AM, kundan kumar  wrote:

> Hi,
>
> I am using spark streaming check-pointing mechanism and reading the data
> from kafka. The window duration for my application is 2 hrs with a sliding
> interval of 15 minutes.
>
> So, my batches run at following intervals...
> 09:45
> 10:00
> 10:15
> 10:30  and so on
>
> Suppose, my running batch dies at 09:55 and I restart the application at
> 12:05, then the flow is something like
>
> At 12:05 it would run the 10:00 batch -> would this read the kafka offsets
> from the time it went down (or 9:45)  to 12:00 ? or  just upto 10:10 ?
> then next would 10:15 batch - what would be the offsets as input for this
> batch ? ...so on for all the queued batches
>
>
> Basically, my requirement is such that when the application is restarted
> at 12:05 then it should read the kafka offsets till 10:00  and then the
> next queued batch takes offsets from 10:00 to 10:15 and so on until all the
> queued batches are processed.
>
> If this is the way offsets are handled for all the queued batched and I am
> fine.
>
> Or else please provide suggestions on how this can be done.
>
>
>
> Thanks!!!
>
>


Re: large, dense matrix multiplication

2015-11-13 Thread Eilidh Troup
Hi Sab,

Thanks for your response. We’re thinking of trying a bigger cluster, because we 
just started with 2 nodes. What we really want to know is whether the code will 
scale up with larger matrices and more nodes. I’d be interested to hear how 
large a matrix multiplication you managed to do?

Is there an alternative you’d recommend for calculating similarity over a large 
dataset?

Thanks,
Eilidh

On 13 Nov 2015, at 09:55, Sabarish Sasidharan  
wrote:

> We have done this by blocking but without using BlockMatrix. We used our own 
> blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What is the 
> size of your block? How much memory are you giving to the executors? I assume 
> you are running on YARN, if so you would want to make sure your yarn executor 
> memory overhead is set to a higher value than default.
> 
> Just curious, could you also explain why you need matrix multiplication with 
> transpose? Smells like similarity computation.
> 
> Regards
> Sab
> 
> On Thu, Nov 12, 2015 at 7:27 PM, Eilidh Troup  wrote:
> Hi,
> 
> I’m trying to multiply a large squarish matrix with its transpose. Eventually 
> I’d like to work with matrices of size 200,000 by 500,000, but I’ve started 
> off first with 100 by 100 which was fine, and then with 10,000 by 10,000 
> which failed with an out of memory exception.
> 
> I used MLlib and BlockMatrix and tried various block sizes, and also tried 
> switching disk serialisation on.
> 
> We are running on a small cluster, using a CSV file in HDFS as the input data.
> 
> Would anyone with experience of multiplying large, dense matrices in spark be 
> able to comment on what to try to make this work?
> 
> Thanks,
> Eilidh
> 
> 
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
> 
> 
> 
> -- 
> 
> Architect - Big Data
> Ph: +91 99805 99458
> 
> Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
> India ICT)
> +++

The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

hang correlated to number of shards Re: Checkpointing with Kinesis hangs with socket timeouts when driver is relaunched while transforming on a 0 event batch

2015-11-13 Thread Hster Geguri
Just an update that the kinesis checkpointing works well with orderly and
kill -9 driver shutdowns when there is less than 4 shards.  We use 20+.

I created a case with Amazon support since it is the AWS kinesis getRecords
API which is hanging.

Regards,
Heji

On Thu, Nov 12, 2015 at 10:37 AM, Hster Geguri  wrote:

> Hello everyone,
>
> We are testing checkpointing against YARN 2.7.1 with Spark 1.5. We are
> trying to make sure checkpointing works with orderly shutdowns(i.e. yarn
> application --kill) and unexpected shutdowns which we simulate with a kill
> -9.  If there is anyone who has successfully tested failover recently with
> Kinesis+YARN, I would appreciate even the confirmation this should work.
>
> We have a simple driver that does aggregate counts per minute against a
> Kinesis stream.  We see initial hanging behavior (~2- 10 minutes) in both
> relaunches.  When we do an "unexpected" shutdown of the application master
> with "kill -9" of the jvm process, yarn successfully kills the orphan
> executors, launches a new driver with new executors. The logs indicate the
> restoring the checkpoint was successful.  However, the first two Spark
> streaming batches which are of 0 events intermittently hang for anywhere
> between 2-10+ minutes with a SocketTimeoutException while doing a Kinesis
> getRecords (stack trace at the end of this mail).
>
> Under normal circumstances, Spark skips the transform and mapToPair stages
> on 0 events. However when the executors hang, we notice that the job goes
> through the transform stage and tasks hangs in making a getRecord call from
> Kinesis for 2 minutes before emitting a "SocketTimeoutException: Read timed
> out" for a Kinesis getRecords call.
>
> Kinesis  as a service should behave more gracefully even if it was fed bad
> parameters but why does Spark call getRecords when the batch size is 0 when
> relaunching?
>
> Any input is greatly appreciated as we are stuck on testing failover.
>
> Heji
>
> I've put the stack trace below:
>
> [2015-11-09 15:20:23,478] INFO Unable to execute HTTP request: Read timed
> out (com.amazonaws.http.AmazonHttpClient)
> java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
> at java.net.SocketInputStream.read(SocketInputStream.java:170)
> at java.net.SocketInputStream.read(SocketInputStream.java:141)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593)
> at sun.security.ssl.InputRecord.read(InputRecord.java:532)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:961)
> at
> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:918)
> at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
> at
> org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
> at
> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
> at
> org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
> at
> com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)
> at
> com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
> at
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.loadMore(UTF8StreamJsonParser.java:176)
> at
> com.fasterxml.jackson.core.base.ParserBase.loadMoreGuaranteed(ParserBase.java:408)
> at
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString2(UTF8StreamJsonParser.java:2184)
> at
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser._finishString(UTF8StreamJsonParser.java:2165)
> at
> com.fasterxml.jackson.core.json.UTF8StreamJsonParser.getText(UTF8StreamJsonParser.java:279)
> at
> com.amazonaws.transform.JsonUnmarshallerContextImpl.readCurrentJsonTokenValue(JsonUnmarshallerContextImpl.java:129)
> at
> com.amazonaws.transform.JsonUnmarshallerContextImpl.readText(JsonUnmarshallerContextImpl.java:123)
> at
> com.amazonaws.transform.SimpleTypeJsonUnmarshallers$ByteBufferJsonUnmarshaller.unmarshall(SimpleTypeJsonUnmarshallers.java:185)
> at
> com.amazonaws.services.kinesis.model.transform.RecordJsonUnmarshaller.unmarshall(RecordJsonUnmarshaller.java:58)
> at
> com.amazonaws.services.kinesis.model.transform.RecordJsonUnmarshaller.unmarshall(RecordJsonUnmarshaller.java:31)
> at
> com.amazonaws.transform.ListUnmarshaller.unmarshallJsonToList(ListUnmarshaller.java:93)
> at
> com.amazonaws.transform.ListUnmarshaller.unmarshall(ListUnmarshaller.java:43)
> at
> com.amazonaws.services.kinesis.model.transform.GetRecordsResultJsonUnmarshaller.unmarshall(GetRecordsResultJsonUnmarshaller.java:50)
> at
> 

SequenceFile and object reuse

2015-11-13 Thread jeff saremi
So we tried reading a sequencefile in Spark and realized that all our records 
have ended up becoming the same.
THen one of us found this:

Note: Because Hadoop's RecordReader class re-uses the same Writable object for 
each record, directly caching the returned RDD or directly passing it to an 
aggregation or shuffle operation will create many references to the same 
object. If you plan to directly cache, sort, or aggregate Hadoop writable 
objects, you should first copy them using a map function.

Is there anyone that can shed some light on this bizzare behavior and the 
decisions behind it?
And I also would like to know if anyone's able to read a binary file and not to 
incur the additional map() as suggested by the above? What format did you use?

thanksJeff

Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Have you use any partitioned columns when write as json or parquet?

On Fri, Nov 6, 2015 at 6:53 AM, Rok Roskar  wrote:
> yes I was expecting that too because of all the metadata generation and
> compression. But I have not seen performance this bad for other parquet
> files I’ve written and was wondering if there could be something obvious
> (and wrong) to do with how I’ve specified the schema etc. It’s a very simple
> schema consisting of a StructType with a few StructField floats and a
> string. I’m using all the spark defaults for io compression.
>
> I'll see what I can do about running a profiler -- can you point me to a
> resource/example?
>
> Thanks,
>
> Rok
>
> ps: my post on the mailing list is still listed as not accepted by the
> mailing list:
> http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-td25295.html
> -- none of your responses are there either. I am definitely subscribed to
> the list though (I get daily digests). Any clue how to fix it?
>
>
>
>
> On Nov 6, 2015, at 9:26 AM, Cheng Lian  wrote:
>
> I'd expect writing Parquet files slower than writing JSON files since
> Parquet involves more complicated encoders, but maybe not that slow. Would
> you mind to try to profile one Spark executor using tools like YJP to see
> what's the hotspot?
>
> Cheng
>
> On 11/6/15 7:34 AM, rok wrote:
>
> Apologies if this appears a second time!
>
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
> parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> the size of file this is way over-provisioned (I've tried it with fewer
> partitions and fewer nodes, no obvious effect). I was expecting the dump to
> disk to be very fast -- the DataFrame is cached in memory and contains just
> 14 columns (13 are floats and one is a string). When I write it out in json
> format, this is indeed reasonably fast (though it still takes a few minutes,
> which is longer than I would expect).
>
> However, when I try to write a parquet file it takes way longer -- the first
> set of tasks finishes in a few minutes, but the subsequent tasks take more
> than twice as long or longer. In the end it takes over half an hour to write
> the file. I've looked at the disk I/O and cpu usage on the compute nodes and
> it looks like the processors are fully loaded while the disk I/O is
> essentially zero for long periods of time. I don't see any obvious garbage
> collection issues and there are no problems with memory.
>
> Any ideas on how to debug/fix this?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Columnar Statisics

2015-11-13 Thread sara mustafa
Hi,
I am using Spark 1.5.2 and I notice the existence of the class
org.apache.spark.sql.columnar.ColumnStatisticsSchema, How can I use it to
calculate column statistics of a DataFrame?

Thanks,



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Columnar-Statisics-tp25377.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What is difference btw reduce & fold?

2015-11-13 Thread firemonk9
Thes is very well explained. 

Thank you



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-is-difference-btw-reduce-fold-tp22653p25376.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkException: Could not read until the end sequence number of the range

2015-11-13 Thread Alan Dipert
Hi all,
We're running Spark 1.5.0 on EMR 4.1.0 in AWS and consuming from Kinesis.

We saw the following exception today - it killed the Spark "step":

org.apache.spark.SparkException: Could not read until the end sequence
number of the range

We guessed it was because our Kinesis stream didn't have enough shards and
we were being throttled.  We bumped the number of shards and haven't seen
the problem again over the past several hours, but I am curious: does this
sound like the actual reason?  Was bumping shard # appropriate?

I'd be curious to hear if anyone else experienced this issue and knows
exactly what the problem is.

Thanks,
Alan


pyspark sql: number of partitions and partition by size?

2015-11-13 Thread Wei Chen
Hey Friends,

I am trying to use sqlContext.write.parquet() to write dataframe to parquet
files. I have the following questions.

1. number of partitions
The default number of partition seems to be 200. Is there any way other
than using df.repartition(n) to change this number? I was told repartition
can be very expensive.

2. partition by size
When I use df.partitionBy(['year']), if the number of entries with
"year=2006" is very small, the sizes of the files under partition
"year=2006" can be very small. If we can assign a size to each partition
file, that'll be very helpful.


Thank you,
Wei


does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Andy Davidson
In R, its easy to split a data set into training, crossValidation, and test
set. Is there something like this in spark.ml? I am using python of now.

My real problem is I want to randomly select a relatively small data set to
do some initial data exploration. Its not clear to me how using spark I
could create a random sample from a large data set. I would prefer to sample
with out replacement.

I have not tried to use sparkR yet. I assume I would not be able to use the
caret package with spark ML

Kind regards

Andy

```{R}
   inTrain <- createDataPartition(y=csv$classe, p=0.7, list=FALSE)
trainSetDF <- csv[inTrain,]
testSetDF <- csv[-inTrain,]
```





Join and HashPartitioner question

2015-11-13 Thread Alexander Pivovarov
Hi Everyone

Is there any difference in performance btw the following two joins?


val r1: RDD[(String, String]) = ???
val r2: RDD[(String, String]) = ???

val partNum = 80
val partitioner = new HashPartitioner(partNum)

// Join 1
val res1 = r1.partitionBy(partitioner).join(r2.partitionBy(partitioner))

// Join 2
val res2 = r1.join(r2, partNum)


Re: Distributing Python code packaged as tar balls

2015-11-13 Thread Davies Liu
Python does not support library as tar balls, so PySpark may also not
support that.

On Wed, Nov 4, 2015 at 5:40 AM, Praveen Chundi  wrote:
> Hi,
>
> Pyspark/spark-submit offers a --py-files handle to distribute python code
> for execution. Currently(version 1.5) only zip files seem to be supported, I
> have tried distributing tar balls unsuccessfully.
>
> Is it worth adding support for tar balls?
>
> Best regards,
> Praveen Chundi
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: bin/pyspark SparkContext is missing?

2015-11-13 Thread Davies Liu
You forgot to create a SparkContext instance:

sc = SparkContext()

On Tue, Nov 3, 2015 at 9:59 AM, Andy Davidson
 wrote:
> I am having a heck of a time getting Ipython notebooks to work on my 1.5.1
> AWS cluster I created using spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2
>
> I have read the instructions for using iPython notebook on
> http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell
>
> I want to run the notebook server on my master and use an ssh tunnel to
> connect a web browser running on my mac.
>
> I am confident the cluster is set up correctly because the sparkPi example
> runs.
>
> I am able to use IPython notebooks on my local mac and work with spark and
> local files with out any problems.
>
> I know the ssh tunnel is working.
>
> On my cluster I am able to use python shell in general
>
> [ec2-user@ip-172-31-29-60 dataScience]$ /root/spark/bin/pyspark --master
> local[2]
>
>
 from pyspark import SparkContext
>
 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.txt")
>
 textFile.take(1)
>
>
>
> When I run the exact same code in iPython notebook I get
>
> ---
> NameError Traceback (most recent call last)
>  in ()
>  11 from pyspark import SparkContext, SparkConf
>  12
> ---> 13 textFile =
> sc.textFile("file:///home/ec2-user/dataScience/readme.txt")
>  14
>  15 textFile.take(1)
>
> NameError: name 'sc' is not defined
>
>
>
>
> To try an debug I wrote a script to launch pyspark and added ‘set –x’ to
> pyspark so I could see what the script was doing
>
> Any idea how I can debug this?
>
> Thanks in advance
>
> Andy
>
> $ cat notebook.sh
>
> set -x
>
> export PYSPARK_DRIVER_PYTHON=ipython
>
> export PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7000"
>
> /root/spark/bin/pyspark --master local[2]
>
>
>
>
> [ec2-user@ip-172-31-29-60 dataScience]$ ./notebook.sh
>
> ++ export PYSPARK_DRIVER_PYTHON=ipython
>
> ++ PYSPARK_DRIVER_PYTHON=ipython
>
> ++ export 'PYSPARK_DRIVER_PYTHON_OPTS=notebook --no-browser --port=7000'
>
> ++ PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser --port=7000'
>
> ++ /root/spark/bin/pyspark --master 'local[2]'
>
> +++ dirname /root/spark/bin/pyspark
>
> ++ cd /root/spark/bin/..
>
> ++ pwd
>
> + export SPARK_HOME=/root/spark
>
> + SPARK_HOME=/root/spark
>
> + source /root/spark/bin/load-spark-env.sh
>
>  dirname /root/spark/bin/pyspark
>
> +++ cd /root/spark/bin/..
>
> +++ pwd
>
> ++ FWDIR=/root/spark
>
> ++ '[' -z '' ']'
>
> ++ export SPARK_ENV_LOADED=1
>
> ++ SPARK_ENV_LOADED=1
>
>  dirname /root/spark/bin/pyspark
>
> +++ cd /root/spark/bin/..
>
> +++ pwd
>
> ++ parent_dir=/root/spark
>
> ++ user_conf_dir=/root/spark/conf
>
> ++ '[' -f /root/spark/conf/spark-env.sh ']'
>
> ++ set -a
>
> ++ . /root/spark/conf/spark-env.sh
>
> +++ export JAVA_HOME=/usr/java/latest
>
> +++ JAVA_HOME=/usr/java/latest
>
> +++ export SPARK_LOCAL_DIRS=/mnt/spark,/mnt2/spark
>
> +++ SPARK_LOCAL_DIRS=/mnt/spark,/mnt2/spark
>
> +++ export SPARK_MASTER_OPTS=
>
> +++ SPARK_MASTER_OPTS=
>
> +++ '[' -n 1 ']'
>
> +++ export SPARK_WORKER_INSTANCES=1
>
> +++ SPARK_WORKER_INSTANCES=1
>
> +++ export SPARK_WORKER_CORES=2
>
> +++ SPARK_WORKER_CORES=2
>
> +++ export HADOOP_HOME=/root/ephemeral-hdfs
>
> +++ HADOOP_HOME=/root/ephemeral-hdfs
>
> +++ export
> SPARK_MASTER_IP=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>
> +++ SPARK_MASTER_IP=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>
>  cat /root/spark-ec2/cluster-url
>
> +++ export
> MASTER=spark://ec2-54-215-207-132.us-west-1.compute.amazonaws.com:7077
>
> +++ MASTER=spark://ec2-54-215-207-132.us-west-1.compute.amazonaws.com:7077
>
> +++ export SPARK_SUBMIT_LIBRARY_PATH=:/root/ephemeral-hdfs/lib/native/
>
> +++ SPARK_SUBMIT_LIBRARY_PATH=:/root/ephemeral-hdfs/lib/native/
>
> +++ export SPARK_SUBMIT_CLASSPATH=::/root/ephemeral-hdfs/conf
>
> +++ SPARK_SUBMIT_CLASSPATH=::/root/ephemeral-hdfs/conf
>
>  wget -q -O - http://169.254.169.254/latest/meta-data/public-hostname
>
> +++ export
> SPARK_PUBLIC_DNS=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>
> +++ SPARK_PUBLIC_DNS=ec2-54-215-207-132.us-west-1.compute.amazonaws.com
>
> +++ export YARN_CONF_DIR=/root/ephemeral-hdfs/conf
>
> +++ YARN_CONF_DIR=/root/ephemeral-hdfs/conf
>
>  id -u
>
> +++ '[' 222 == 0 ']'
>
> ++ set +a
>
> ++ '[' -z '' ']'
>
> ++ ASSEMBLY_DIR2=/root/spark/assembly/target/scala-2.11
>
> ++ ASSEMBLY_DIR1=/root/spark/assembly/target/scala-2.10
>
> ++ [[ -d /root/spark/assembly/target/scala-2.11 ]]
>
> ++ '[' -d /root/spark/assembly/target/scala-2.11 ']'
>
> ++ export SPARK_SCALA_VERSION=2.10
>
> ++ SPARK_SCALA_VERSION=2.10
>
> + export '_SPARK_CMD_USAGE=Usage: ./bin/pyspark [options]'
>
> + _SPARK_CMD_USAGE='Usage: ./bin/pyspark [options]'
>
> + hash python2.7
>
> + DEFAULT_PYTHON=python2.7
>
> + [[ -n '' ]]
>
> + [[ '' == \1 ]]
>
> + [[ -z ipython ]]
>
> + 

Re: a way to allow spark job to continue despite task failures?

2015-11-13 Thread Ted Yu
I searched the code base and looked at:
https://spark.apache.org/docs/latest/running-on-yarn.html

I didn't find mapred.max.map.failures.percent or its counterpart.

FYI

On Fri, Nov 13, 2015 at 9:05 AM, Nicolae Marasoiu <
nicolae.maras...@adswizz.com> wrote:

> Hi,
>
>
> I know a task can fail 2 times and only the 3rd breaks the entire job.
>
> I am good with this number of attempts.
>
> I would like that after trying a task 3 times, it continues with the other
> tasks.
>
> The job can be "failed", but I want all tasks run.
>
> Please see my use case.
>
>
> I read a hadoop input set, and some gzip files are incomplete. I would
> like to just skip them and the only way I see is to tell Spark to ignore
> some tasks permanently failing, if it is possible. With traditional hadoop
> map-reduce this was possible using mapred.max.map.failures.percent.
>
>
> Do map-reduce params like mapred.max.map.failures.percent apply to
> Spark/YARN map-reduce jobs ?
>
> I edited $HADOOP_CONF_DIR/mapred-site.xml and
> added mapred.max.map.failures.percent=30 but does not seem to apply, job
> still failed after 3 task attempt fails.
>
>
> Should Spark transmit this parameter? Or the mapred.* do not apply?
>
> Do other hadoop parameters (e.g. the ones involved in the input reading,
> not in the "processing" or "application" like this max.map.failures) - are
> others taken into account and transmitted? I saw that it should scan
> HADOOP_CONF_DIR and forward those, but I guess this does not apply to any
> parameter, since Spark has its own distribution & DAG stages processing
> logic, which just happens to have a YARN implementation.
>
>
> Do you know a way to do this in Spark - to ignore a predefined number of
> tasks fail, but allow the job to continue? This way I could see all the
> faulty input files in one job run, delete them all and continue with the
> rest.
>
>
> Just to mention, doing a manual gzip -t on top of hadoop cat is infeasible
> and map-reduce is way faster to scan the 15K files worth 70GB (its doing
> 25M/s per node), while the old style hadoop cat is doing much less.
>
>
> Thanks,
>
> Nicu
>


Re: very slow parquet file write

2015-11-13 Thread Rok Roskar
I'm not sure what you mean? I didn't do anything specifically to partition
the columns
On Nov 14, 2015 00:38, "Davies Liu"  wrote:

> Do you have partitioned columns?
>
> On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar  wrote:
> > I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions
> into a
> > parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> > the size of file this is way over-provisioned (I've tried it with fewer
> > partitions and fewer nodes, no obvious effect). I was expecting the dump
> to
> > disk to be very fast -- the DataFrame is cached in memory and contains
> just
> > 14 columns (13 are floats and one is a string). When I write it out in
> json
> > format, this is indeed reasonably fast (though it still takes a few
> minutes,
> > which is longer than I would expect).
> >
> > However, when I try to write a parquet file it takes way longer -- the
> first
> > set of tasks finishes in a few minutes, but the subsequent tasks take
> more
> > than twice as long or longer. In the end it takes over half an hour to
> write
> > the file. I've looked at the disk I/O and cpu usage on the compute nodes
> and
> > it looks like the processors are fully loaded while the disk I/O is
> > essentially zero for long periods of time. I don't see any obvious
> garbage
> > collection issues and there are no problems with memory.
> >
> > Any ideas on how to debug/fix this?
> >
> > Thanks!
> >
> >
>


send transformed RDD to s3 from slaves

2015-11-13 Thread Walrus theCat
Hi,

I have an RDD which crashes the driver when being collected.  I want to
send the data on its partitions out to S3 without bringing it back to the
driver. I try calling rdd.foreachPartition, but the data that gets sent has
not gone through the chain of transformations that I need.  It's the data
as it was ingested initially.  After specifying my chain of
transformations, but before calling foreachPartition, I call rdd.count in
order to force the RDD to transform.  The data it sends out is still not
transformed.  How do I get the RDD to send out transformed data when
calling foreachPartition?

Thanks


Re: very slow parquet file write

2015-11-13 Thread Davies Liu
Do you have partitioned columns?

On Thu, Nov 5, 2015 at 2:08 AM, Rok Roskar  wrote:
> I'm writing a ~100 Gb pyspark DataFrame with a few hundred partitions into a
> parquet file on HDFS. I've got a few hundred nodes in the cluster, so for
> the size of file this is way over-provisioned (I've tried it with fewer
> partitions and fewer nodes, no obvious effect). I was expecting the dump to
> disk to be very fast -- the DataFrame is cached in memory and contains just
> 14 columns (13 are floats and one is a string). When I write it out in json
> format, this is indeed reasonably fast (though it still takes a few minutes,
> which is longer than I would expect).
>
> However, when I try to write a parquet file it takes way longer -- the first
> set of tasks finishes in a few minutes, but the subsequent tasks take more
> than twice as long or longer. In the end it takes over half an hour to write
> the file. I've looked at the disk I/O and cpu usage on the compute nodes and
> it looks like the processors are fully loaded while the disk I/O is
> essentially zero for long periods of time. I don't see any obvious garbage
> collection issues and there are no problems with memory.
>
> Any ideas on how to debug/fix this?
>
> Thanks!
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



out of memory error with Parquet

2015-11-13 Thread AlexG
I'm using Spark to read in a data from many files and write it back out in
Parquet format for ease of use later on. Currently, I'm using this code:

 val fnamesRDD = sc.parallelize(fnames,
ceil(fnames.length.toFloat/numfilesperpartition).toInt)
 val results = fnamesRDD.mapPartitionsWithIndex((index, fnames) =>
extractData(fnames, variablenames, index))
 results.toDF.saveAsParquetFile(outputdir)

extractData returns an array of tuples of (filename, array of floats)
corresponding to all the files in the partition. Each partition results in
about .6Gb data, corresponding to just 3 files per partition.

The problem is, I have 100s of files to convert, and apparently
saveAsParquetFile tries to pull down the data from too many of the
conversion tasks onto the driver at a time, so causes an OOM. E.g., I got an
error about it trying to pull down >4GB of data corresponding to 9 tasks
onto the driver. I could increase the driver memory, but this wouldn't help
if saveAsParquet then decided to pull in 100 tasks at a time.

Is there a way to avoid this OOM error?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spak filestreaming issue

2015-11-13 Thread ravi.gawai
Hi,
I am trying simple file streaming example using
Sparkstreaming(spark-streaming_2.10,version:1.5.1)

public class DStreamExample {

 public static void main(final String[] args) {
  
   final SparkConf sparkConf = new SparkConf();
sparkConf.setAppName("SparkJob");

sparkConf.setMaster("local[4]"); // for local

final JavaSparkContext sc = new JavaSparkContext(sparkConf);

final JavaStreamingContext ssc = new JavaStreamingContext(sc,
new Duration(2000));

final JavaDStream lines = ssc.textFileStream("/opt/test/");

lines.print();

ssc.start();
ssc.awaitTermination();
}
}

When I run this code on single file or director it does not print anything
from file, I see in logs its constantly polling but nothing is printed. I
tried moving file to directory when this program was running.

Is there something I am missing?  I tried applying map function on lines RDD
that also does not work.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spak-filestreaming-issue-tp25380.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save GraphX to disk

2015-11-13 Thread SLiZn Liu
Hi Gaurav,

Your graph can be saved to graph databases like Neo4j or Titan through
their drivers, that eventually saved to the disk.

BR,
Todd

Gaurav Kumar
gauravkuma...@gmail.com>于2015年11月13日 周五22:08写道:

> Hi,
>
> I was wondering how to save a graph to disk and load it back again. I know
> how to save vertices and edges to disk and construct the graph from them,
> not sure if there's any method to save the graph itself to disk.
>
> Best Regards,
> Gaurav Kumar
> Big Data • Data Science • Photography • Music
> +91 9953294125
>


Re: does spark ML have some thing like createDataPartition() in R caret package ?

2015-11-13 Thread Sonal Goyal
The RDD has a takeSample method where you can supply the flag for
replacement or not as well as the fraction to sample.
On Nov 14, 2015 2:51 AM, "Andy Davidson" 
wrote:

> In R, its easy to split a data set into training, crossValidation, and
> test set. Is there something like this in spark.ml? I am using python of
> now.
>
> My real problem is I want to randomly select a relatively small data set
> to do some initial data exploration. Its not clear to me how using spark I
> could create a random sample from a large data set. I would prefer to
> sample with out replacement.
>
> I have not tried to use sparkR yet. I assume I would not be able to use
> the caret package with spark ML
>
> Kind regards
>
> Andy
>
> ```{R}
>inTrain <- createDataPartition(y=csv$classe, p=0.7, list=FALSE)
> trainSetDF <- csv[inTrain,]
> testSetDF <- csv[-inTrain,]
> ```
>
>


Re: large, dense matrix multiplication

2015-11-13 Thread Burak Yavuz
Hi,

The BlockMatrix multiplication should be much more efficient on the current
master (and will be available with Spark 1.6). Could you please give that a
try if you have the chance?

Thanks,
Burak

On Fri, Nov 13, 2015 at 10:11 AM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> Hi Eilidh
>
> Because you are multiplying with the transpose you don't have  to
> necessarily build the right side of the matrix. I hope you see that. You
> can broadcast blocks of the indexed row matrix to itself and achieve the
> multiplication.
>
> But for similarity computation you might want to use some approach like
> locality sensitive hashing first to identify a bunch of similar customers
> and then apply cosine similarity on that narrowed down list. That would
> scale much better than matrix multiplication. You could try the following
> options for the same.
>
> https://github.com/soundcloud/cosine-lsh-join-spark
> http://spark-packages.org/package/tdebatty/spark-knn-graphs
> https://github.com/marufaytekin/lsh-spark
>
> Regards
> Sab
> Hi Sab,
>
> Thanks for your response. We’re thinking of trying a bigger cluster,
> because we just started with 2 nodes. What we really want to know is
> whether the code will scale up with larger matrices and more nodes. I’d be
> interested to hear how large a matrix multiplication you managed to do?
>
> Is there an alternative you’d recommend for calculating similarity over a
> large dataset?
>
> Thanks,
> Eilidh
>
> On 13 Nov 2015, at 09:55, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
> We have done this by blocking but without using BlockMatrix. We used our
> own blocking mechanism because BlockMatrix didn't exist in Spark 1.2. What
> is the size of your block? How much memory are you giving to the executors?
> I assume you are running on YARN, if so you would want to make sure your
> yarn executor memory overhead is set to a higher value than default.
>
> Just curious, could you also explain why you need matrix multiplication
> with transpose? Smells like similarity computation.
>
> Regards
> Sab
>
> On Thu, Nov 12, 2015 at 7:27 PM, Eilidh Troup 
> wrote:
>
>> Hi,
>>
>> I’m trying to multiply a large squarish matrix with its transpose.
>> Eventually I’d like to work with matrices of size 200,000 by 500,000, but
>> I’ve started off first with 100 by 100 which was fine, and then with 10,000
>> by 10,000 which failed with an out of memory exception.
>>
>> I used MLlib and BlockMatrix and tried various block sizes, and also
>> tried switching disk serialisation on.
>>
>> We are running on a small cluster, using a CSV file in HDFS as the input
>> data.
>>
>> Would anyone with experience of multiplying large, dense matrices in
>> spark be able to comment on what to try to make this work?
>>
>> Thanks,
>> Eilidh
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
>
> Architect - Big Data
> Ph: +91 99805 99458
>
> Manthan Systems | *Company of the year - Analytics (2014 Frost and
> Sullivan India ICT)*
> +++
>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>


RE: Connecting SparkR through Yarn

2015-11-13 Thread Sun, Rui
I guess this is not related to SparkR. It seems that Spark can’t pick 
hostname/IP address of RM.

Make sure you have correctly set YARN_CONF_DIR env var and have configured 
address of RM in yarn-site.xml.

From: Amit Behera [mailto:amit.bd...@gmail.com]
Sent: Friday, November 13, 2015 9:38 PM
To: Sun, Rui; user@spark.apache.org
Subject: Re: Connecting SparkR through Yarn

Hi Sun,
Thank you for reply.

I did the same, but now I am getting another issue.

i.e: Not able to connect to ResourceManager after submitting the job
the Error message showing something like this

Connecting to ResourceManager at /0.0.0.0:8032
- INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8032. Already tried 0 time(s);
-INFO ipc.Client: Retrying connect to server: 
0.0.0.0/0.0.0.0:8032. Already tried 1 time(s)


I tried resolving this , but not able to do.



On Wed, Nov 11, 2015 at 12:02 PM, Sun, Rui 
> wrote:
Amit,
You can simply set “MASTER” as “yarn-client” before calling sparkR.init().
Sys.setenv("MASTER"="yarn-client")

I assume that you have set “YARN_CONF_DIR” env variable required for running 
Spark on YARN.

If you want to set more YARN specific configurations, you can for example
Sys.setenv ("SPARKR_SUBMIT_ARGS", " --master yarn-client --num-executors 4 
sparkr-shell"
Before calling sparkR.init().

From: Amit Behera [mailto:amit.bd...@gmail.com]
Sent: Monday, November 9, 2015 2:36 AM
To: user@spark.apache.org
Subject: Connecting SparkR through Yarn

Hi All,

Spark Version = 1.5.1
Hadoop Version = 2.6.0

I set up the cluster in Amazon EC2 machines (1+5)
I am able create a SparkContext object using init method from RStudio.

But I do not know how can I create a SparkContext object in yarn mode.

I got the below link to run on yarn. but in this document its given for Spark 
version >= 0.9.0 and <= 1.2.

https://github.com/amplab-extras/SparkR-pkg/blob/master/README.md#running-on-yarn


Please help me how can I connect SparkR on Yarn.



Thanks,
Amit.



Re: out of memory error with Parquet

2015-11-13 Thread AlexG
Never mind; when I switched to Spark 1.5.0, my code works as written and is
pretty fast! Looking at some Parquet related Spark jiras, it seems that
Parquet is known to have some memory issues with buffering and writing, and
that at least some were resolved in Spark 1.5.0. 






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381p25382.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: out of memory error with Parquet

2015-11-13 Thread Josh Rosen
Tip: jump straight to 1.5.2; it has some key bug fixes.

Sent from my phone

> On Nov 13, 2015, at 10:02 PM, AlexG  wrote:
> 
> Never mind; when I switched to Spark 1.5.0, my code works as written and is
> pretty fast! Looking at some Parquet related Spark jiras, it seems that
> Parquet is known to have some memory issues with buffering and writing, and
> that at least some were resolved in Spark 1.5.0. 
> 
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/out-of-memory-error-with-Parquet-tp25381p25382.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Kafka Offsets after application is restarted using Spark Streaming Checkpointing

2015-11-13 Thread kundan kumar
Hi,

I am using spark streaming check-pointing mechanism and reading the data
from kafka. The window duration for my application is 2 hrs with a sliding
interval of 15 minutes.

So, my batches run at following intervals...
09:45
10:00
10:15
10:30  and so on

Suppose, my running batch dies at 09:55 and I restart the application at
12:05, then the flow is something like

At 12:05 it would run the 10:00 batch -> would this read the kafka offsets
from the time it went down (or 9:45)  to 12:00 ? or  just upto 10:10 ?
then next would 10:15 batch - what would be the offsets as input for this
batch ? ...so on for all the queued batches


Basically, my requirement is such that when the application is restarted at
12:05 then it should read the kafka offsets till 10:00  and then the next
queued batch takes offsets from 10:00 to 10:15 and so on until all the
queued batches are processed.

If this is the way offsets are handled for all the queued batched and I am
fine.

Or else please provide suggestions on how this can be done.



Thanks!!!


Spark and Spring Integrations

2015-11-13 Thread Netai Biswas
Hi,

I am facing issue while integrating spark with spring. 

I am getting "java.lang.IllegalStateException: Cannot deserialize
BeanFactory with id" errors for all beans. I have tried few solutions
available in web. Please help me out to solve this issue.

Few details:

Java : 8
Spark : 1.5.1
Spring : 3.2.9.RELEASE

Please let me know if you need more information or any sample code.

Thanks,
Netai 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-and-Spring-Integrations-tp25375.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark Executors off-heap memory usage keeps increasing

2015-11-13 Thread Balthasar Schopman
Hi,

The off-heap memory usage of the 3 Spark executor processes keeps increasing 
constantly until the boundaries of the physical RAM are hit. This happened two 
weeks ago, at which point the system comes to a grinding halt, because it's 
unable to spawn new processes. At such a moment restarting Spark is the obvious 
solution. In the collectd memory usage graph in the link below (1) we see two 
moments that we've restarted Spark: last week when we upgraded Spark from 1.4.1 
to 1.5.1 and two weeks ago when the physical memory was exhausted.
(1) http://i.stack.imgur.com/P4DE3.png

As can be seen at the bottom of this mail (2), the Spark executor process uses 
approx. 62GB of memory, while the heap size max is set to 20GB. This means the 
off-heap memory usage is approx. 42GB.

Some info:
 - We use Spark Streaming lib.
 - Our code is written in Java.
 - We run Oracle Java v1.7.0_76
 - Data is read from Kafka (Kafka runs on different boxes).
 - Data is written to Cassandra (Cassandra runs on different boxes).
 - 1 Spark master and 3 Spark executors/workers, running on 4 separate boxes.
 - We recently upgraded Spark to 1.4.1 and 1.5.1 and the memory usage pattern 
is identical on all those versions.

What can be the cause of this ever-increasing off-heap memory use?

PS: I've posted this question on StackOverflow yesterday: 
http://stackoverflow.com/questions/33668035/spark-executors-off-heap-memory-usage-keeps-increasing

(2)
$ ps aux | grep 40724
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
apache-+ 40724  140 47.1 75678780 62181644 ?   Sl   Nov06 11782:27 
/usr/lib/jvm/java-7-oracle/jre/bin/java -cp 
/opt/spark-1.5.1-bin-hadoop2.4/conf/:/opt/spark-1.5.1-bin-hadoop2.4/lib/spark-assembly-1.5.1-hadoop2.4.0.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar:/opt/spark-1.5.1-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar
 -Xms20480M -Xmx20480M -Dspark.driver.port=7201 -Dspark.blockManager.port=7206 
-Dspark.executor.port=7202 -Dspark.broadcast.port=7204 
-Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 
-XX:MaxPermSize=256m org.apache.spark.executor.CoarseGrainedExecutorBackend 
--driver-url 
akka.tcp://sparkdri...@xxx.xxx.xxx.xxx:7201/user/CoarseGrainedScheduler 
--executor-id 2 --hostname xxx.xxx.xxx.xxx --cores 10 --app-id 
app-20151106125547- --worker-url 
akka.tcp://sparkwor...@xxx.xxx.xxx.xxx:7200/user/Worker
$ sudo -u apache-spark jps
40724 CoarseGrainedExecutorBackend
40517 Worker
30664 Jps
$ sudo -u apache-spark jstat -gc 40724
 S0CS1CS0US1U  EC   EUOC OU   PC PU 
   YGC YGCTFGCFGCT GCT
158720.0 157184.0 110339.8  0.0   6674944.0 1708036.1 13981184.0 2733206.2  
59904.0 59551.9  41944 1737.864  39 13.464 1751.328
$ sudo -u apache-spark jps -v
40724 CoarseGrainedExecutorBackend -Xms20480M -Xmx20480M 
-Dspark.driver.port=7201 -Dspark.blockManager.port=7206 
-Dspark.executor.port=7202 -Dspark.broadcast.port=7204 
-Dspark.fileserver.port=7203 -Dspark.replClassServer.port=7205 
-XX:MaxPermSize=256m
40517 Worker -Xms2048m -Xmx2048m -XX:MaxPermSize=256m
10693 Jps -Dapplication.home=/usr/lib/jvm/java-7-oracle -Xms8m

Kind regards,

Balthasar Schopman
Software Developer
LeaseWeb Technologies B.V.

T: +31 20 316 0232
M:
E: b.schop...@tech.leaseweb.com
W: http://www.leaseweb.com

Luttenbergweg 8, 1101 EC Amsterdam, Netherlands



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Traing data sets storage requirement

2015-11-13 Thread Veluru, Aruna
Hi All,

   Just started understanding / getting hands on with Spark, 
Streaming and MLLIb. We are in the design phase and need suggestions on the 
training data storage requirement.

Batch Layer: Our core systems generate data which we will be using as batch 
data, currently SQL server is being used by core systems. Our requirement is to 
pull data from core databases and transform the data using spark job and store 
it into Cassandra. Train the model by pulling data from Cassandra and store the 
prediction results in the Cassandra itself.

Real time Layer: We are also planning have real time layer which stores live 
data from devices to Cassandra for further analysis using MLLib.

Heard that there is no need of Cassandra in this design as Spark itself 
provides storage. Please provide suggestions whether Cassandra is required or 
not and also suggest best way to handle:

[cid:image002.png@01D11E32.A9B207B0]




Aruna Veluru | Senior Lead Analyst | Bally 
Technologies  | (O) +1 702 532 2832 | (M) +91 99 7222 
6213

May be privileged. May be confidential. Please delete if not the addressee.
Veluru Veluru



image001.emz
Description: image001.emz


spark 1.4 GC issue

2015-11-13 Thread Renu Yadav
 am using spark 1.4 and my application is taking much time in GC around
60-70% of time for each task

I am using parallel GC.
please help somebody as soon as possible.

Thanks,
Renu


Re: Stack Overflow Question

2015-11-13 Thread Sabarish Sasidharan
The reserved cores are to prevent starvation so that user B cam run jobs
when user A's job is already running and using almost all of the cluster.
You can change your scheduler configuration to use more cores.

Regards
Sab
On 13-Nov-2015 6:56 pm, "Parin Choganwala"  wrote:

> EMR 4.1.0 + Spark 1.5.0 + YARN Resource Allocation
>
> http://stackoverflow.com/q/33488869/1366507?sem=2
>


Save GraphX to disk

2015-11-13 Thread Gaurav Kumar
Hi,

I was wondering how to save a graph to disk and load it back again. I know
how to save vertices and edges to disk and construct the graph from them,
not sure if there's any method to save the graph itself to disk.

Best Regards,
Gaurav Kumar
Big Data • Data Science • Photography • Music
+91 9953294125


Stack Overflow Question

2015-11-13 Thread Parin Choganwala
EMR 4.1.0 + Spark 1.5.0 + YARN Resource Allocation

http://stackoverflow.com/q/33488869/1366507?sem=2