spark streaming application - deployment best practices

2016-06-15 Thread vimal dinakaran
Hi All,
I am using spark-submit cluster mode deployment for my application to run
it in production.

But this places a requirement of having the jars in the same path in all
the nodes and also the config file which is passed as argument in the same
path. I am running spark in standalone mode and I don't have hadoop or S3
environment. This mode of deployment is inconvenient. I could do spark
submit from one node in client mode but it doesn't provide high availablity
.

What is the best way to deploy spark streaming applications in production ?


Thanks

Vimal


Re: Spark SQL driver memory keeps rising

2016-06-15 Thread Mich Talebzadeh
you will need to be more specific about how you are using these parameters.

have you looked at spark WEB GUI (default port 4040) to see the jobs and
stages. the amount of shuffle will also be given.

also it helps if you do jps on OS and send the output of ps aux|grep ,PID> as
well.

What sort of resource manager are you using? Have you looked at yarn logs?

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 03:23, Mohammed Guller  wrote:

> It would be hard to guess what could be going on without looking at the
> code. It looks like the driver program goes into a long stop-the-world GC
> pause. This should not happen on the machine running the driver program if
> all that you are doing is reading data from HDFS, perform a bunch of
> transformations and write result back into HDFS.
>
>
>
> Perhaps, the program is not actually using Spark in cluster mode, but
> running Spark in local mode?
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Khaled Hammouda [mailto:khaled.hammo...@kik.com]
> *Sent:* Tuesday, June 14, 2016 10:23 PM
> *To:* user
> *Subject:* Spark SQL driver memory keeps rising
>
>
>
> I'm having trouble with a Spark SQL job in which I run a series of SQL
> transformations on data loaded from HDFS.
>
>
>
> The first two stages load data from hdfs input without issues, but later
> stages that require shuffles cause the driver memory to keep rising until
> it is exhausted, and then the driver stalls, the spark UI stops responding,
> and the I can't even kill the driver with ^C, I have to forcibly kill the
> process.
>
>
>
> I think I'm allocating enough memory to the driver: driver memory is 44
> GB, and spark.driver.memoryOverhead is 4.5 GB. When I look at the memory
> usage, the driver memory before the shuffle starts is at about 2.4 GB
> (virtual mem size for the driver process is about 50 GB), and then once the
> stages that require shuffle start I can see the driver memory rising fast
> to about 47 GB, then everything stops responding.
>
>
>
> I'm not invoking any output operation that collects data at the driver. I
> just call .cache() on a couple of dataframes since they get used more than
> once in the SQL transformations, but those should be cached on the workers.
> Then I write the final result to a parquet file, but the job doesn't get to
> this final stage.
>
>
>
> What could possibly be causing the driver memory to rise that fast when no
> data is being collected at the driver?
>
>
>
> Thanks,
>
> Khaled
>


How to deal with tasks running too long?

2016-06-15 Thread Utkarsh Sengar
This SO question was asked about 1yr ago.
http://stackoverflow.com/questions/31799755/how-to-deal-with-tasks-running-too-long-comparing-to-others-in-job-in-yarn-cli

I answered this question with a suggestion to try speculation but it
doesn't quite do what the OP expects. I have been running into this issue
more these days. Out of 5000 tasks, 4950 completes in 5mins but the last 50
never really completes, have tried waiting for 4hrs. This can be a memory
issue or maybe the way spark's fine grained mode works with mesos, I am
trying to enable jmxsink to get a heap dump.

But in the mean time, is there a better fix for this? (in any version of
spark, I am using 1.5.1 but can upgrade). It would be great if the last 50
tasks in my example can be killed (timed out) and the stage completes
successfully.

-- 
Thanks,
-Utkarsh


Re: Spark Memory Error - Not enough space to cache broadcast

2016-06-15 Thread Cassa L
Hi,
 I did set  --driver-memory 4G. I still run into this issue after 1 hour of
data load.

I also tried version 1.6 in test environment. I hit this issue much faster
than in 1.5.1 setup.
LCassa

On Tue, Jun 14, 2016 at 3:57 PM, Gaurav Bhatnagar 
wrote:

> try setting the option --driver-memory 4G
>
> On Tue, Jun 14, 2016 at 3:52 PM, Ben Slater 
> wrote:
>
>> A high level shot in the dark but in our testing we found Spark 1.6 a lot
>> more reliable in low memory situations (presumably due to
>> https://issues.apache.org/jira/browse/SPARK-1). If it’s an option,
>> probably worth a try.
>>
>> Cheers
>> Ben
>>
>> On Wed, 15 Jun 2016 at 08:48 Cassa L  wrote:
>>
>>> Hi,
>>> I would appreciate any clue on this. It has become a bottleneck for our
>>> spark job.
>>>
>>> On Mon, Jun 13, 2016 at 2:56 PM, Cassa L  wrote:
>>>
 Hi,

 I'm using spark 1.5.1 version. I am reading data from Kafka into Spark and 
 writing it into Cassandra after processing it. Spark job starts fine and 
 runs all good for some time until I start getting below errors. Once these 
 errors come, job start to lag behind and I see that job has scheduling and 
 processing delays in streaming  UI.

 Worker memory is 6GB, executor-memory is 5GB, I also tried to tweak 
 memoryFraction parameters. Nothing works.


 16/06/13 21:26:02 INFO MemoryStore: ensureFreeSpace(4044) called with 
 curMem=565394, maxMem=2778495713
 16/06/13 21:26:02 INFO MemoryStore: Block broadcast_69652_piece0 stored as 
 bytes in memory (estimated size 3.9 KB, free 2.6 GB)
 16/06/13 21:26:02 INFO TorrentBroadcast: Reading broadcast variable 69652 
 took 2 ms
 16/06/13 21:26:02 WARN MemoryStore: Failed to reserve initial memory 
 threshold of 1024.0 KB for computing block broadcast_69652 in memory.
 16/06/13 21:26:02 WARN MemoryStore: Not enough space to cache 
 broadcast_69652 in memory! (computed 496.0 B so far)
 16/06/13 21:26:02 INFO MemoryStore: Memory use = 556.1 KB (blocks) + 2.6 
 GB (scratch space shared across 0 tasks(s)) = 2.6 GB. Storage limit = 2.6 
 GB.
 16/06/13 21:26:02 WARN MemoryStore: Persisting block broadcast_69652 to 
 disk instead.
 16/06/13 21:26:02 INFO BlockManager: Found block rdd_100761_1 locally
 16/06/13 21:26:02 INFO Executor: Finished task 0.0 in stage 71577.0 (TID 
 452316). 2043 bytes result sent to driver


 Thanks,

 L


>>> --
>> 
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798
>>
>
>


Re: Is that possible to feed web request via spark application directly?

2016-06-15 Thread Peyman Mohajerian
There are a variety of REST API services you can use, but you must consider
carefully whether it makes sense to start a Spark job based on individual
requests, unless you mean based on some triggering event you want to start
a Spark job, in which case it makes sense to use the RESTful service.
Whether your Spark cluster is multi-tenant depends on the scheduler you
use, cluster size and other factors. But you seem to be mixing terminology,
a given application can certainly generate many tasks and you could deploy
many applications on a single Spark cluster, whether you do them all
concurrently or not is where the issue of multi-tenancy comes into picture.

On Wed, Jun 15, 2016 at 8:19 PM, Yu Wei  wrote:

> Hi,
>
> I'm learning spark recently. I have one question about spark.
>
>
> Is it possible to feed web requests via spark application directly? Is
> there any library to be used?
>
> Or do I need to write the results from spark to HDFS/HBase?
>
>
> Is one spark application only to be designed to implement one single task?
> Or could multiple tasks be combined into one application?
>
>
>
> Thanks,
>
> Jared
>
>
>


Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Ashwin Raaghav
Thanks! That worked! :)

And to read the files, I used pyspark.SparkFiles module.


On Thu, Jun 16, 2016 at 7:12 AM, Sun Rui  wrote:

> have you tried
> --files ?
> > On Jun 15, 2016, at 18:50, ar7  wrote:
> >
> > I am using PySpark 1.6.1 for my spark application. I have additional
> modules
> > which I am loading using the argument --py-files. I also have a h5 file
> > which I need to access from one of the modules for initializing the
> > ApolloNet.
> >
> > Is there any way I could access those files from the modules if I put
> them
> > in the same archive? I tried this approach but it was throwing an error
> > because the files are not there in every worker. I can think of one
> solution
> > which is copying the file to each of the workers but I want to know if
> there
> > are better ways to do it?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Adding-h5-files-in-a-zip-to-use-with-PySpark-tp27173.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>
>


-- 
Regards,

Ashwin Raaghav


Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi everyone,
I added more logs for my use case:

When I cached all my data 500 mil records and count.
I receive this.
16/06/16 10:09:25 ERROR TaskSetManager: Total size of serialized results of 27 
tasks (1876.7 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
>>> that weird because I just count
After increase maxResultSize to 10g
I still waiting slow for result and error
16/06/16 10:09:25 INFO BlockManagerInfo: Removed taskresult_94 on slave1:27743 
in memory (size: 69.5 MB, free: 6.2 GB)
org.apache.spark.SparkException: Job aborted due to stage failure: Total size 
of serialized results of 15 tasks (1042.6 MB) is bigger than 
spark.driver.maxResultSize (1024.0 MB
)
  at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
  at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1437)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
  at scala.Option.foreach(Option.scala:257)
  at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1659)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1618)
  at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1607)
  at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
  at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1863)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1876)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1889)
  at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
  at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:883)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
  at org.apache.spark.rdd.RDD.collect(RDD.scala:882)
  at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:290)
  at 
org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2122)
  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
  at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2436)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2121)
  at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2128)
  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2156)
  at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2155)
  at org.apache.spark.sql.Dataset.withCallback(Dataset.scala:2449)
  at org.apache.spark.sql.Dataset.count(Dataset.scala:2155)
  ... 48 elided

I lost all my executors.



> On Jun 15, 2016, at 8:44 PM, Chanh Le  wrote:
> 
> Hi Gene,
> I am using Alluxio 1.1.0.
> Spark 2.0 Preview version. 
> Load from alluxio then cached and query for 2nd time. Spark will stuck.
> 
> 
> 
>> On Jun 15, 2016, at 8:42 PM, Gene Pang > > wrote:
>> 
>> Hi,
>> 
>> Which version of Alluxio are you using?
>> 
>> Thanks,
>> Gene
>> 
>> On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le > > wrote:
>> I am testing Spark 2.0
>> I load data from alluxio and cached then I query but the first query is ok 
>> because it kick off cache action. But after that I run the query again and 
>> it’s stuck.
>> I ran in cluster 5 nodes in spark-shell.
>> 
>> Did anyone has this issue?
>> 
>> 
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org 
>> 
>> 
>> 
> 



Is that possible to feed web request via spark application directly?

2016-06-15 Thread Yu Wei
Hi,

I'm learning spark recently. I have one question about spark.


Is it possible to feed web requests via spark application directly? Is there 
any library to be used?

Or do I need to write the results from spark to HDFS/HBase?


Is one spark application only to be designed to implement one single task? Or 
could multiple tasks be combined into one application?



Thanks,

Jared



Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Gourav Sengupta
Hi Mahender,

please ensure that for dimension tables you are enabling the broadcast
method. You must be able to see surprising gains @12x.

Overall I think that SPARK cannot figure out whether to scan all the
columns in a table or just the ones which are being used causing this
issue.

When you start using HIVE with ORC and TEZ  (*) you will see some amazing
results, and leaves SPARK way way behind. So pretty much you need to have
your data in memory for matching the performance claims of SPARK and the
advantage in that case you are getting is not because of SPARK algorithms
but just fast I/O from RAM. The advantage of SPARK is that it makes
accessible analytics, querying, and streaming frameworks together.


In case you are following the optimisations mentioned in the link you
hardly have any reasons for using SPARK SQL:
http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/ . And
imagine being able to do all of that without having machines which requires
huge RAM, or in short you are achieving those performance gains using
commodity low cost systems around which HADOOP was designed.

I think that Hortonworks is giving a stiff competition here :)

Regards,
Gourav Sengupta

On Wed, Jun 15, 2016 at 11:35 PM, Mahender Sarangam <
mahender.bigd...@outlook.com> wrote:

> +1,
>
> Even see performance degradation while comparing SPark SQL with Hive.
> We have table of 260 columns. We have executed in hive and SPARK. In Hive,
> it is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins
> of time.
> On 6/9/2016 3:19 PM, Gavin Yue wrote:
>
> Could you print out the sql execution plan? My guess is about broadcast
> join.
>
>
>
> On Jun 9, 2016, at 07:14, Gourav Sengupta < 
> gourav.sengu...@gmail.com> wrote:
>
> Hi,
>
> Query1 is almost 25x faster in HIVE than in SPARK. What is happening here
> and is there a way we can optimize the queries in SPARK without the obvious
> hack in Query2.
>
>
> ---
> ENVIRONMENT:
> ---
>
> > Table A 533 columns x 24 million rows and Table B has 2 columns x 3
> million rows. Both the files are single gzipped csv file.
> > Both table A and B are external tables in AWS S3 and created in HIVE
> accessed through SPARK using HiveContext
> > EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using
> allowMaximumResource allocation and node types are c3.4xlarge).
>
> --
> QUERY1:
> --
> select A.PK, B.FK
> from A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
>
>
> This query takes 4 mins in HIVE and 1.1 hours in SPARK
>
>
> --
> QUERY 2:
> --
>
> select A.PK, B.FK
> from (select PK from A) A
> left outer join B on (A.PK = B.FK)
> where B.FK is not null;
>
> This query takes 4.5 mins in SPARK
>
>
>
> Regards,
> Gourav Sengupta
>
>
>
>
>


RE: Spark SQL driver memory keeps rising

2016-06-15 Thread Mohammed Guller
It would be hard to guess what could be going on without looking at the code. 
It looks like the driver program goes into a long stop-the-world GC pause. This 
should not happen on the machine running the driver program if all that you are 
doing is reading data from HDFS, perform a bunch of transformations and write 
result back into HDFS.

Perhaps, the program is not actually using Spark in cluster mode, but running 
Spark in local mode?

Mohammed
Author: Big Data Analytics with 
Spark

From: Khaled Hammouda [mailto:khaled.hammo...@kik.com]
Sent: Tuesday, June 14, 2016 10:23 PM
To: user
Subject: Spark SQL driver memory keeps rising

I'm having trouble with a Spark SQL job in which I run a series of SQL 
transformations on data loaded from HDFS.

The first two stages load data from hdfs input without issues, but later stages 
that require shuffles cause the driver memory to keep rising until it is 
exhausted, and then the driver stalls, the spark UI stops responding, and the I 
can't even kill the driver with ^C, I have to forcibly kill the process.

I think I'm allocating enough memory to the driver: driver memory is 44 GB, and 
spark.driver.memoryOverhead is 4.5 GB. When I look at the memory usage, the 
driver memory before the shuffle starts is at about 2.4 GB (virtual mem size 
for the driver process is about 50 GB), and then once the stages that require 
shuffle start I can see the driver memory rising fast to about 47 GB, then 
everything stops responding.

I'm not invoking any output operation that collects data at the driver. I just 
call .cache() on a couple of dataframes since they get used more than once in 
the SQL transformations, but those should be cached on the workers. Then I 
write the final result to a parquet file, but the job doesn't get to this final 
stage.

What could possibly be causing the driver memory to rise that fast when no data 
is being collected at the driver?

Thanks,
Khaled


Re: streaming example has error

2016-06-15 Thread Lee Ho Yeung
got another error StreamingContext: Error starting the context, marking it
as stopped

/home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
com.databricks:spark-csv_2.11:1.4.0
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount").set("spark.driver.allowMultipleContexts",
"true")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", )
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
ssc.start()
ssc.awaitTermination()



scala> val pairs = words.map(word => (word, 1))
pairs: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
org.apache.spark.streaming.dstream.MappedDStream@61a5e7

scala> val wordCounts = pairs.reduceByKey(_ + _)
wordCounts: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
org.apache.spark.streaming.dstream.ShuffledDStream@a522f1

scala> ssc.start()
16/06/15 19:14:10 ERROR StreamingContext: Error starting the context,
marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output
operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.streaming.DStreamGraph.validate(DStreamGraph.scala:161)
at
org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:542)
at
org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at
$line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:39)
at
$line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:44)
at
$line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:46)
at
$line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:48)
at
$line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:50)
at
$line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:52)
at
$line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:54)
at $line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56)
at $line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58)
at $line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC.(:60)
at $line42.$read$$iwC$$iwC$$iwC$$iwC.(:62)
at $line42.$read$$iwC$$iwC$$iwC.(:64)
at $line42.$read$$iwC$$iwC.(:66)
at $line42.$read$$iwC.(:68)
at $line42.$read.(:70)
at $line42.$read$.(:74)
at $line42.$read$.()
at $line42.$eval$.(:7)
at $line42.$eval$.()
at $line42.$eval.$print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at 

Re: GraphX performance and settings

2016-06-15 Thread Deepak Goel
I am not an expert but some thoughts inline

On Jun 16, 2016 6:31 AM, "Maja Kabiljo"  wrote:
>
> Hi,
>
> We are running some experiments with GraphX in order to compare it with
other systems. There are multiple settings which significantly affect
performance, and we experimented a lot in order to tune them well. I’ll
share here what are the best we found so far and which results we got with
them, and would really appreciate if anyone who used GraphX before has any
advice on what else can make it even better, or confirm that these results
are as good as it gets.
>
> Algorithms we used are pagerank and connected components. We used Twitter
and UK graphs from the GraphX paper (
https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf), and
also generated graphs with properties similar to Facebook social graph with
various number of edges. Apart from performance we tried to see what is the
minimum amount of resources it requires in order to handle graph of some
size.
>
> We ran experiments using Spark 1.6.1, on machines which have 20 cores
with 2-way SMT, always fixing number of executors (min=max=initial), giving
40GB or 80GB per executor, and making sure we run only a single executor
per machine.

***Deepak***
I guess you have 16 machines in your test. Is that right?
**Deepak***

Additionally we used:
> spark.shuffle.manager=hash, spark.shuffle.service.enabled=false
> Parallel GC
> PartitionStrategy.EdgePartition2D
> 8*numberOfExecutors partitions
> Here are some data points which we got:
> Running on Facebook-like graph with 2 billion edges, using 4 executors
with 80GB each it took 451 seconds to do 20 iterations of pagerank and 236
seconds to find connected components. It failed when we tried to use 2
executors, or 4 executors with 40GB each.
> For graph with 10 billion edges we needed 16 executors with 80GB each (it
failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds
for connected component

**Deepak*
The executors are not scaling linearly. You should need max of 10
executors. Also what is the error it is showing for 8 executors?
*Deepak**

> Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank
473s, connected components 264s. With 4 executors 80GB each it worked but
was struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.

*Deepak*
For 4 executors can you try with 160GB. Also if you could spell out the
system statistics during the test it would be great. My guess is with 4
connectors a lot of spilling is happening
*Deepak***

> One more thing, we were not able to reproduce what’s mentioned in the
paper about fault tolerance (section 5.2). If we kill an executor during
first few iterations it recovers successfully, but if killed in later
iterations reconstruction of each iteration starts taking exponentially
longer and doesn’t finish after letting it run for a few hours. Are there
some additional parameters which we need to set in order for this to work?
>
> Any feedback would be highly appreciated!
>
> Thank you,
> Maja


Re: java server error - spark

2016-06-15 Thread spR
hey,

Thanks. Now it worked.. :)

On Wed, Jun 15, 2016 at 6:59 PM, Jeff Zhang  wrote:

> Then the only solution is to increase your driver memory but still
> restricted by your machine's memory.  "--driver-memory"
>
> On Thu, Jun 16, 2016 at 9:53 AM, spR  wrote:
>
>> Hey,
>>
>> But I just have one machine. I am running everything on my laptop. Won't
>> I be able to do this processing in local mode then?
>>
>> Regards,
>> Tejaswini
>>
>> On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang  wrote:
>>
>>> You are using local mode, --executor-memory  won't take effect for
>>> local mode, please use other cluster mode.
>>>
>>> On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang  wrote:
>>>
 Specify --executor-memory in your spark-submit command.



 On Thu, Jun 16, 2016 at 9:01 AM, spR  wrote:

> Thank you. Can you pls tell How to increase the executor memory?
>
>
>
> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang  wrote:
>
>> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>>
>> It is OOM on the executor.  Please try to increase executor memory.
>> "--executor-memory"
>>
>>
>>
>>
>>
>> On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:
>>
>>> Hey,
>>>
>>> error trace -
>>>
>>> hey,
>>>
>>>
>>> error trace -
>>>
>>>
>>> ---Py4JJavaError
>>>  Traceback (most recent call 
>>> last) in ()> 1 temp.take(2)
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>>  in take(self, num)304 with SCCallSiteSync(self._sc) as 
>>> css:305 port = 
>>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
>>>  306 self._jdf, num)307 return 
>>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>>> 308
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>>  in __call__(self, *args)811 answer = 
>>> self.gateway_client.send_command(command)812 return_value = 
>>> get_return_value(--> 813 answer, self.gateway_client, 
>>> self.target_id, self.name)814
>>> 815 for temp_arg in temp_args:
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 
>>> try:---> 45 return f(*a, **kw) 46 except 
>>> py4j.protocol.Py4JJavaError as e: 47 s = 
>>> e.java_exception.toString()
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>>  in get_return_value(answer, gateway_client, target_id, name)306
>>>  raise Py4JJavaError(307 "An error 
>>> occurred while calling {0}{1}{2}.\n".--> 308 
>>> format(target_id, ".", name), value)309 else:
>>> 310 raise Py4JError(
>>> Py4JJavaError: An error occurred while calling 
>>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>>> Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 
>>> in stage 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC 
>>> overhead limit exceeded
>>> at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>>> at 
>>> com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>>> at 
>>> com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>>> at 
>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>>> at 
>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>>> at 
>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>>> at 
>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>>> at 
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at 

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Then the only solution is to increase your driver memory but still
restricted by your machine's memory.  "--driver-memory"

On Thu, Jun 16, 2016 at 9:53 AM, spR  wrote:

> Hey,
>
> But I just have one machine. I am running everything on my laptop. Won't I
> be able to do this processing in local mode then?
>
> Regards,
> Tejaswini
>
> On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang  wrote:
>
>> You are using local mode, --executor-memory  won't take effect for local
>> mode, please use other cluster mode.
>>
>> On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang  wrote:
>>
>>> Specify --executor-memory in your spark-submit command.
>>>
>>>
>>>
>>> On Thu, Jun 16, 2016 at 9:01 AM, spR  wrote:
>>>
 Thank you. Can you pls tell How to increase the executor memory?



 On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang  wrote:

> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
> It is OOM on the executor.  Please try to increase executor memory.
> "--executor-memory"
>
>
>
>
>
> On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:
>
>> Hey,
>>
>> error trace -
>>
>> hey,
>>
>>
>> error trace -
>>
>>
>> ---Py4JJavaError
>>  Traceback (most recent call 
>> last) in ()> 1 temp.take(2)
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>  in take(self, num)304 with SCCallSiteSync(self._sc) as css: 
>>305 port = 
>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
>>  306 self._jdf, num)307 return 
>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>> 308
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>  in __call__(self, *args)811 answer = 
>> self.gateway_client.send_command(command)812 return_value = 
>> get_return_value(--> 813 answer, self.gateway_client, 
>> self.target_id, self.name)814
>> 815 for temp_arg in temp_args:
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 
>> try:---> 45 return f(*a, **kw) 46 except 
>> py4j.protocol.Py4JJavaError as e: 47 s = 
>> e.java_exception.toString()
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>  in get_return_value(answer, gateway_client, target_id, name)306 
>> raise Py4JJavaError(307 "An error 
>> occurred while calling {0}{1}{2}.\n".--> 308 
>> format(target_id, ".", name), value)309 else:
>> 310 raise Py4JError(
>> Py4JJavaError: An error occurred while calling 
>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: 
>> Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 
>> in stage 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC 
>> overhead limit exceeded
>>  at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>>  at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>>  at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>>  at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>>  at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>>  at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>  at 

Re: java server error - spark

2016-06-15 Thread spR
Hey,

But I just have one machine. I am running everything on my laptop. Won't I
be able to do this processing in local mode then?

Regards,
Tejaswini

On Wed, Jun 15, 2016 at 6:32 PM, Jeff Zhang  wrote:

> You are using local mode, --executor-memory  won't take effect for local
> mode, please use other cluster mode.
>
> On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang  wrote:
>
>> Specify --executor-memory in your spark-submit command.
>>
>>
>>
>> On Thu, Jun 16, 2016 at 9:01 AM, spR  wrote:
>>
>>> Thank you. Can you pls tell How to increase the executor memory?
>>>
>>>
>>>
>>> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang  wrote:
>>>
 >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded


 It is OOM on the executor.  Please try to increase executor memory.
 "--executor-memory"





 On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:

> Hey,
>
> error trace -
>
> hey,
>
>
> error trace -
>
>
> ---Py4JJavaError
>  Traceback (most recent call 
> last) in ()> 1 temp.take(2)
>
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>  in take(self, num)304 with SCCallSiteSync(self._sc) as css:  
>   305 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
>  306 self._jdf, num)307 return 
> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> 308
>
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)811 answer = 
> self.gateway_client.send_command(command)812 return_value = 
> get_return_value(--> 813 answer, self.gateway_client, 
> self.target_id, self.name)814
> 815 for temp_arg in temp_args:
>
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 
> 45 return f(*a, **kw) 46 except 
> py4j.protocol.Py4JJavaError as e: 47 s = 
> e.java_exception.toString()
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)306  
>raise Py4JJavaError(307 "An error 
> occurred while calling {0}{1}{2}.\n".--> 308 
> format(target_id, ".", name), value)309 else:
> 310 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead 
> limit exceeded
>   at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>   at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>   at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>   at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>   at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at 

Re: Adding h5 files in a zip to use with PySpark

2016-06-15 Thread Sun Rui
have you tried
--files ?
> On Jun 15, 2016, at 18:50, ar7  wrote:
> 
> I am using PySpark 1.6.1 for my spark application. I have additional modules
> which I am loading using the argument --py-files. I also have a h5 file
> which I need to access from one of the modules for initializing the
> ApolloNet.
> 
> Is there any way I could access those files from the modules if I put them
> in the same archive? I tried this approach but it was throwing an error
> because the files are not there in every worker. I can think of one solution
> which is copying the file to each of the workers but I want to know if there
> are better ways to do it?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Adding-h5-files-in-a-zip-to-use-with-PySpark-tp27173.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
Top of my head, I can think of the zip operation that RDD provides. So for 
example, if you have two DataFrames df1 and df2, you could do something like 
this:

val newDF = df1.rdd.zip(df2.rdd).map { case(rowFromDf1, rowFromDf2) => 
()}.toDF(...)

Couple of things to keep in mind:

1)  Both df1 and df2 should have the same number of rows.

2)  You are assuming that row N from df1 is related to row N from df2.

Mohammed
Author: Big Data Analytics with 
Spark

From: spR [mailto:data.smar...@gmail.com]
Sent: Wednesday, June 15, 2016 4:08 PM
To: Mohammed Guller
Cc: Natu Lauchande; user
Subject: Re: concat spark dataframes

Hey,

There are quite a lot of fields. But, there are no common fields between the 2 
dataframes. Can I not concatenate the 2 frames like we can do in pandas such 
that the resulting dataframe has columns from both the dataframes?

Thank you.

Regards,
Misha



On Wed, Jun 15, 2016 at 3:44 PM, Mohammed Guller 
> wrote:
Hi Misha,
What is the schema for both the DataFrames? And what is the expected schema of 
the resulting DataFrame?

Mohammed
Author: Big Data Analytics with 
Spark

From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Wednesday, June 15, 2016 2:07 PM
To: spR
Cc: user
Subject: Re: concat spark dataframes

Hi,
You can select the common collumns and use DataFrame.union all .
Regards,
Natu

On Wed, Jun 15, 2016 at 8:57 PM, spR 
> wrote:
hi,

how to concatenate spark dataframes? I have 2 frames with certain columns. I 
want to get a dataframe with columns from both the other frames.

Regards,
Misha




Re: java server error - spark

2016-06-15 Thread Jeff Zhang
You are using local mode, --executor-memory  won't take effect for local
mode, please use other cluster mode.

On Thu, Jun 16, 2016 at 9:32 AM, Jeff Zhang  wrote:

> Specify --executor-memory in your spark-submit command.
>
>
>
> On Thu, Jun 16, 2016 at 9:01 AM, spR  wrote:
>
>> Thank you. Can you pls tell How to increase the executor memory?
>>
>>
>>
>> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang  wrote:
>>
>>> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>>
>>> It is OOM on the executor.  Please try to increase executor memory.
>>> "--executor-memory"
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:
>>>
 Hey,

 error trace -

 hey,


 error trace -


 ---Py4JJavaError
  Traceback (most recent call 
 last) in ()> 1 temp.take(2)

 /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
  in take(self, num)304 with SCCallSiteSync(self._sc) as css:   
  305 port = 
 self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
  306 self._jdf, num)307 return 
 list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
 308

 /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
  in __call__(self, *args)811 answer = 
 self.gateway_client.send_command(command)812 return_value = 
 get_return_value(--> 813 answer, self.gateway_client, 
 self.target_id, self.name)814
 815 for temp_arg in temp_args:

 /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 
 45 return f(*a, **kw) 46 except 
 py4j.protocol.Py4JJavaError as e: 47 s = 
 e.java_exception.toString()
 /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
  in get_return_value(answer, gateway_client, target_id, name)306   
   raise Py4JJavaError(307 "An error 
 occurred while calling {0}{1}{2}.\n".--> 308 
 format(target_id, ".", name), value)309 else:
 310 raise Py4JError(
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead limit 
 exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
at 
 com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
at 
 com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
at 
 org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

 

Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Specify --executor-memory in your spark-submit command.



On Thu, Jun 16, 2016 at 9:01 AM, spR  wrote:

> Thank you. Can you pls tell How to increase the executor memory?
>
>
>
> On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang  wrote:
>
>> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>
>>
>> It is OOM on the executor.  Please try to increase executor memory.
>> "--executor-memory"
>>
>>
>>
>>
>>
>> On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:
>>
>>> Hey,
>>>
>>> error trace -
>>>
>>> hey,
>>>
>>>
>>> error trace -
>>>
>>>
>>> ---Py4JJavaError
>>>  Traceback (most recent call 
>>> last) in ()> 1 temp.take(2)
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>>  in take(self, num)304 with SCCallSiteSync(self._sc) as css:
>>> 305 port = 
>>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
>>>  306 self._jdf, num)307 return 
>>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>>> 308
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>>  in __call__(self, *args)811 answer = 
>>> self.gateway_client.send_command(command)812 return_value = 
>>> get_return_value(--> 813 answer, self.gateway_client, 
>>> self.target_id, self.name)814
>>> 815 for temp_arg in temp_args:
>>>
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
>>>  in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 
>>> 45 return f(*a, **kw) 46 except 
>>> py4j.protocol.Py4JJavaError as e: 47 s = 
>>> e.java_exception.toString()
>>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>>  in get_return_value(answer, gateway_client, target_id, name)306
>>>  raise Py4JJavaError(307 "An error occurred 
>>> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
>>> ".", name), value)309 else:
>>> 310 raise Py4JError(
>>> Py4JJavaError: An error occurred while calling 
>>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>>> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 
>>> 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead limit 
>>> exceeded
>>> at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>> at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>>> at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>>> at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>>> at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>>> at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>>> at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>>> at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>>> at 
>>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>>> at 
>>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>>> at 
>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>>> at 
>>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at 
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org 
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>>> at 
>>> 

Re: java server error - spark

2016-06-15 Thread spR
hey,

I did this in my notebook. But still I get the same error. Is this the
right way to do it?

from pyspark import SparkConf
conf = (SparkConf()
 .setMaster("local[4]")
 .setAppName("My app")
 .set("spark.executor.memory", "12g"))
sc.conf = conf

On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang  wrote:

> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
> It is OOM on the executor.  Please try to increase executor memory.
> "--executor-memory"
>
>
>
>
>
> On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:
>
>> Hey,
>>
>> error trace -
>>
>> hey,
>>
>>
>> error trace -
>>
>>
>> ---Py4JJavaError
>>  Traceback (most recent call 
>> last) in ()> 1 temp.take(2)
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>  in take(self, num)304 with SCCallSiteSync(self._sc) as css:
>> 305 port = 
>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(--> 
>> 306 self._jdf, num)307 return 
>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>> 308
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>  in __call__(self, *args)811 answer = 
>> self.gateway_client.send_command(command)812 return_value = 
>> get_return_value(--> 813 answer, self.gateway_client, 
>> self.target_id, self.name)814
>> 815 for temp_arg in temp_args:
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc 
>> in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45  
>>return f(*a, **kw) 46 except 
>> py4j.protocol.Py4JJavaError as e: 47 s = 
>> e.java_exception.toString()
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>  in get_return_value(answer, gateway_client, target_id, name)306 
>> raise Py4JJavaError(307 "An error occurred 
>> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
>> ".", name), value)309 else:
>> 310 raise Py4JError(
>> Py4JJavaError: An error occurred while calling 
>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
>> (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
>>  at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>>  at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>>  at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>>  at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>>  at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>>  at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>>  at org.apache.spark.scheduler.DAGScheduler.org 
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>>  at 
>> 

Re: java server error - spark

2016-06-15 Thread spR
Thank you. Can you pls tell How to increase the executor memory?



On Wed, Jun 15, 2016 at 5:59 PM, Jeff Zhang  wrote:

> >>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
> It is OOM on the executor.  Please try to increase executor memory.
> "--executor-memory"
>
>
>
>
>
> On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:
>
>> Hey,
>>
>> error trace -
>>
>> hey,
>>
>>
>> error trace -
>>
>>
>> ---Py4JJavaError
>>  Traceback (most recent call 
>> last) in ()> 1 temp.take(2)
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>>  in take(self, num)304 with SCCallSiteSync(self._sc) as css:
>> 305 port = 
>> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(--> 
>> 306 self._jdf, num)307 return 
>> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
>> 308
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>>  in __call__(self, *args)811 answer = 
>> self.gateway_client.send_command(command)812 return_value = 
>> get_return_value(--> 813 answer, self.gateway_client, 
>> self.target_id, self.name)814
>> 815 for temp_arg in temp_args:
>>
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc 
>> in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45  
>>return f(*a, **kw) 46 except 
>> py4j.protocol.Py4JJavaError as e: 47 s = 
>> e.java_exception.toString()
>> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>>  in get_return_value(answer, gateway_client, target_id, name)306 
>> raise Py4JJavaError(307 "An error occurred 
>> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
>> ".", name), value)309 else:
>> 310 raise Py4JError(
>> Py4JJavaError: An error occurred while calling 
>> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
>> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
>> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
>> (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
>>  at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>>  at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>>  at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>>  at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>>  at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>>  at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>>  at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>>  at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>>  at 
>> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>>  at 
>> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>  at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>  at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>>  at org.apache.spark.scheduler.DAGScheduler.org 
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>>  at 
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>>  at 
>> 

GraphX performance and settings

2016-06-15 Thread Maja Kabiljo
Hi,

We are running some experiments with GraphX in order to compare it with other 
systems. There are multiple settings which significantly affect performance, 
and we experimented a lot in order to tune them well. I'll share here what are 
the best we found so far and which results we got with them, and would really 
appreciate if anyone who used GraphX before has any advice on what else can 
make it even better, or confirm that these results are as good as it gets.

Algorithms we used are pagerank and connected components. We used Twitter and 
UK graphs from the GraphX paper 
(https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf), and 
also generated graphs with properties similar to Facebook social graph with 
various number of edges. Apart from performance we tried to see what is the 
minimum amount of resources it requires in order to handle graph of some size.

We ran experiments using Spark 1.6.1, on machines which have 20 cores with 
2-way SMT, always fixing number of executors (min=max=initial), giving 40GB or 
80GB per executor, and making sure we run only a single executor per machine. 
Additionally we used:

  *   spark.shuffle.manager=hash, spark.shuffle.service.enabled=false
  *   Parallel GC
  *   PartitionStrategy.EdgePartition2D
  *   8*numberOfExecutors partitions

Here are some data points which we got:

  *   Running on Facebook-like graph with 2 billion edges, using 4 executors 
with 80GB each it took 451 seconds to do 20 iterations of pagerank and 236 
seconds to find connected components. It failed when we tried to use 2 
executors, or 4 executors with 40GB each.
  *   For graph with 10 billion edges we needed 16 executors with 80GB each (it 
failed with 8), 1041 seconds for 20 iterations of pagerank and 716 seconds for 
connected components.
  *   Twitter-2010 graph (1.5 billion edges), 8 executors, 40GB each, pagerank 
473s, connected components 264s. With 4 executors 80GB each it worked but was 
struggling (pr 2475s, cc 4499s), with 8 executors 80GB pr 362s, cc 255s.

One more thing, we were not able to reproduce what's mentioned in the paper 
about fault tolerance (section 5.2). If we kill an executor during first few 
iterations it recovers successfully, but if killed in later iterations 
reconstruction of each iteration starts taking exponentially longer and doesn't 
finish after letting it run for a few hours. Are there some additional 
parameters which we need to set in order for this to work?

Any feedback would be highly appreciated!

Thank you,
Maja


Re: java server error - spark

2016-06-15 Thread Jeff Zhang
>>> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded


It is OOM on the executor.  Please try to increase executor memory.
"--executor-memory"





On Thu, Jun 16, 2016 at 8:54 AM, spR  wrote:

> Hey,
>
> error trace -
>
> hey,
>
>
> error trace -
>
>
> ---Py4JJavaError
>  Traceback (most recent call 
> last) in ()> 1 temp.take(2)
>
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
>  in take(self, num)304 with SCCallSiteSync(self._sc) as css:
> 305 port = 
> self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(--> 
> 306 self._jdf, num)307 return 
> list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
> 308
>
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)811 answer = 
> self.gateway_client.send_command(command)812 return_value = 
> get_return_value(--> 813 answer, self.gateway_client, 
> self.target_id, self.name)814
> 815 for temp_arg in temp_args:
>
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc 
> in deco(*a, **kw) 43 def deco(*a, **kw): 44 try:---> 45   
>   return f(*a, **kw) 46 except 
> py4j.protocol.Py4JJavaError as e: 47 s = 
> e.java_exception.toString()
> /Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)306  
>raise Py4JJavaError(307 "An error occurred 
> while calling {0}{1}{2}.\n".--> 308 format(target_id, 
> ".", name), value)309 else:
> 310 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 76, localhost): java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
>   at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
>   at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
>   at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
>   at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
>   at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
>   at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
>   at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
>   at 
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
>   at 
> com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
>   at org.apache.spark.scheduler.DAGScheduler.org 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> 

Re: java server error - spark

2016-06-15 Thread spR
Hey,

error trace -

hey,


error trace -


---Py4JJavaError
Traceback (most recent call
last) in ()> 1 temp.take(2)

/Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/dataframe.pyc
in take(self, num)304 with SCCallSiteSync(self._sc) as
css:305 port =
self._sc._jvm.org.apache.spark.sql.execution.EvaluatePython.takeAndServe(-->
306 self._jdf, num)307 return
list(_load_from_socket(port, BatchedSerializer(PickleSerializer(
308

/Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py
in __call__(self, *args)811 answer =
self.gateway_client.send_command(command)812 return_value
= get_return_value(--> 813 answer, self.gateway_client,
self.target_id, self.name)814
815 for temp_arg in temp_args:

/Users/my/Documents/My_Study_folder/spark-1.6.1/python/pyspark/sql/utils.pyc
in deco(*a, **kw) 43 def deco(*a, **kw): 44
try:---> 45 return f(*a, **kw) 46 except
py4j.protocol.Py4JJavaError as e: 47 s =
e.java_exception.toString()
/Users/my/Documents/My_Study_folder/spark-1.6.1/python/lib/py4j-0.9-src.zip/py4j/protocol.py
in get_return_value(answer, gateway_client, target_id, name)306
 raise Py4JJavaError(307 "An error
occurred while calling {0}{1}{2}.\n".--> 308
format(target_id, ".", name), value)309 else:
310 raise Py4JError(
Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0
in stage 3.0 (TID 76, localhost): java.lang.OutOfMemoryError: GC
overhead limit exceeded
at com.mysql.jdbc.MysqlIO.nextRowFast(MysqlIO.java:2205)
at com.mysql.jdbc.MysqlIO.nextRow(MysqlIO.java:1984)
at com.mysql.jdbc.MysqlIO.readSingleRowSet(MysqlIO.java:3403)
at com.mysql.jdbc.MysqlIO.getResultSet(MysqlIO.java:470)
at com.mysql.jdbc.MysqlIO.readResultsForQueryOrUpdate(MysqlIO.java:3105)
at com.mysql.jdbc.MysqlIO.readAllResults(MysqlIO.java:2336)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2729)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2549)
at 
com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
at 
com.mysql.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:1962)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:363)
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at 

Re: Hive 1.0.0 not able to read Spark 1.6.1 parquet output files on EMR 4.7.0

2016-06-15 Thread Cheng Lian

Spark 1.6.1 is also using 1.7.0.

Could you please share the schema of your Parquet file as well as the 
exact exception stack trace reported by Hive?



Cheng


On 6/13/16 12:56 AM, mayankshete wrote:

Hello Team ,

I am facing an issue where output files generated by Spark 1.6.1 are not
read by Hive 1.0.0 . It is because Hive 1.0.0 uses older parquet version
than Spark 1.6.1 which is using 1.7.0 parquet .

Is it possible that we can use older parquet version in Spark or newer
parquet version in Hive ?
I have tried adding parquet-hive-bundle : 1.7.0 to Hive but while reading it
throws Failed with exception
java.io.IOException:java.lang.NullPointerException .

Can anyone give us the solution ?

Thanks ,
Mayank



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-1-0-0-not-able-to-read-Spark-1-6-1-parquet-output-files-on-EMR-4-7-0-tp27144.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: java server error - spark

2016-06-15 Thread Jeff Zhang
Could you paste the full stacktrace ?

On Thu, Jun 16, 2016 at 7:24 AM, spR  wrote:

> Hi,
> I am getting this error while executing a query using sqlcontext.sql
>
> The table has around 2.5 gb of data to be scanned.
>
> First I get out of memory exception. But I have 16 gb of ram
>
> Then my notebook dies and I get below error
>
> Py4JNetworkError: An error occurred while trying to connect to the Java server
>
>
> Thank You
>



-- 
Best Regards

Jeff Zhang


Re: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
Any suggestions on this please

On Wed, Jun 15, 2016 at 10:42 PM, VG  wrote:

> I have a very simple driver which loads a textFile and filters a
>> sub-string from each line in the textfile.
>> When the collect action is executed , I am getting an exception.   (The
>> file is only 90 MB - so I am confused what is going on..) I am running on a
>> local standalone cluster
>>
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on
>> 192.168.56.1:56413 in memory (size: 2.5 KB, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
>> 192.168.56.1:56413 in memory (size: 1900.0 B, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_1 on disk on
>> 192.168.56.1:56413 (size: 2.7 MB)
>> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_7 stored as bytes in
>> memory (estimated size 2.7 MB, free 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_7 in memory on
>> 192.168.56.1:56413 (size: 2.7 MB, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO Executor: Finished task 1.0 in stage 2.0 (TID 7).
>> 2823777 bytes result sent via BlockManager)
>> 16/06/15 19:45:22 INFO TaskSetManager: Starting task 2.0 in stage 2.0
>> (TID 8, localhost, partition 2, PROCESS_LOCAL, 5422 bytes)
>> 16/06/15 19:45:22 INFO Executor: Running task 2.0 in stage 2.0 (TID 8)
>> 16/06/15 19:45:22 INFO HadoopRDD: Input split:
>> file:/C:/Users/i303551/Downloads/ariba-logs/ssws/access.2016.04.26/access.2016.04.26:67108864+25111592
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_2 on disk on
>> 192.168.56.1:56413 (size: 2.0 MB)
>> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_8 stored as bytes in
>> memory (estimated size 2.0 MB, free 2.4 GB)
>> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_8 in memory on
>> 192.168.56.1:56413 (size: 2.0 MB, free: 2.4 GB)
>> 16/06/15 19:45:22 INFO Executor: Finished task 2.0 in stage 2.0 (TID 8).
>> 2143771 bytes result sent via BlockManager)
>> 16/06/15 19:45:43 ERROR RetryingBlockFetcher: Exception while beginning
>> fetch of 1 outstanding blocks
>> java.io.IOException: Failed to connect to /192.168.56.1:56413
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
>> at
>> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
>> at
>> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
>> at
>> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
>> at
>> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
>> at
>> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
>> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
>> at
>> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
>> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> at java.lang.Thread.run(Unknown Source)
>> Caused by: java.net.ConnectException: Connection timed out: no further
>> information: /192.168.56.1:56413
>> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
>> at
>> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
>> at
>> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>> at
>> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>> at
>> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>> ... 1 more
>> 16/06/15 19:45:43 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1
>> outstanding blocks after 5000 ms
>> 16/06/15 19:46:04 ERROR RetryingBlockFetcher: Exception while beginning
>> fetch of 1 outstanding blocks
>> java.io.IOException: Failed to connect to /192.168.56.1:56413
>> at
>> 

Error Running SparkPi.scala Example

2016-06-15 Thread Krishna Kalyan
Hello,
I am faced with problems when I try to run SparkPi.scala.
I took the following steps below:
a) git pull https://github.com/apache/spark
b) Import the project in Intellij as a maven project
c) Run 'SparkPi'

Error Below:
Information:16/06/16 01:34 - Compilation completed with 10 errors and 5
warnings in 5s 843ms
Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
continuing with a stub.
Warning:scalac: Class org.jboss.netty.channel.ChannelPipelineFactory not
found - continuing with a stub.
Warning:scalac: Class org.jboss.netty.handler.execution.ExecutionHandler
not found - continuing with a stub.
Warning:scalac: Class org.jboss.netty.channel.group.ChannelGroup not found
- continuing with a stub.
Warning:scalac: Class com.google.common.collect.ImmutableMap not found -
continuing with a stub.
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkAvroCallbackHandler.scala
Error:(45, 66) not found: type SparkFlumeProtocol
  val transactionTimeout: Int, val backOffInterval: Int) extends
SparkFlumeProtocol with Logging {
 ^
Error:(70, 39) not found: type EventBatch
  override def getEventBatch(n: Int): EventBatch = {
  ^
Error:(85, 13) not found: type EventBatch
new EventBatch("Spark sink has been stopped!", "",
java.util.Collections.emptyList())
^
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
Error:(80, 22) not found: type EventBatch
  def getEventBatch: EventBatch = {
 ^
Error:(48, 37) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
^
Error:(48, 54) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
 ^
Error:(115, 41) not found: type SparkSinkEvent
val events = new util.ArrayList[SparkSinkEvent](maxBatchSize)
^
Error:(146, 28) not found: type EventBatch
  eventBatch = new EventBatch("", seqNum, events)
   ^
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSinkUtils.scala
Error:(25, 27) not found: type EventBatch
  def isErrorBatch(batch: EventBatch): Boolean = {
  ^
/Users/krishna/Experiment/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
Error:(86, 51) not found: type SparkFlumeProtocol
val responder = new SpecificResponder(classOf[SparkFlumeProtocol],
handler.get)

Thanks,
Krishan


Re: Limit pyspark.daemon threads

2016-06-15 Thread Jeff Zhang
>>> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have
set spark.executor.cores to 1, but I see that whenever streaming batch
starts processing data, see python -m pyspark.daemon processes increase
gradually to about 5, (increasing CPU% on a box about 4-5 times, each
pyspark.daemon takes up around 100 % CPU)
>>> After the processing is done 4 pyspark.daemon processes go away and we
are left with one till the next batch run. Also sometimes the  CPU usage
for executor process spikes to about 800% even though spark.executor.core
is set to 1


As my understanding, each spark task consume at most 1 python process.  In
this case (spark.executor.cores=1), there should be only at most 1 python
process for each executor. And here's 4 python processes, I suspect there's
at least 4 executors on this machine. Could you check that ?

On Thu, Jun 16, 2016 at 6:50 AM, Sudhir Babu Pothineni <
sbpothin...@gmail.com> wrote:

> Hi Ken, It may be also related to Grid Engine job scheduling? If it is 16
> core (virtual cores?), grid engine allocates 16 slots, If you use 'max'
> scheduling, it will send 16 processes sequentially to same machine, on the
> top of it each spark job has its own executors. Limit the number of jobs
> scheduled to the machine = number of physical cores of single CPU, it will
> solve the problem if it is related to GE. If you are sure it's related to
> Spark, please ignore.
>
> -Sudhir
>
>
> Sent from my iPhone
>
> On Jun 15, 2016, at 8:53 AM, Gene Pang  wrote:
>
> As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory,
> and you can then share that RDD across different jobs. If you would like to
> run Spark on Alluxio, this documentation can help:
> http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html
>
> Thanks,
> Gene
>
> On Tue, Jun 14, 2016 at 12:44 AM, agateaaa  wrote:
>
>> Hi,
>>
>> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have set
>> spark.executor.cores to 1, but I see that whenever streaming batch starts
>> processing data, see python -m pyspark.daemon processes increase gradually
>> to about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daemon
>> takes up around 100 % CPU)
>>
>> After the processing is done 4 pyspark.daemon processes go away and we
>> are left with one till the next batch run. Also sometimes the  CPU usage
>> for executor process spikes to about 800% even though spark.executor.core
>> is set to 1
>>
>> e.g. top output
>> PID USER  PR   NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
>> 19634 spark 20   0 8871420 1.790g  32056 S 814.1  2.9   0:39.33
>> /usr/lib/j+ <--EXECUTOR
>>
>> 13897 spark 20   0   46576  17916   6720 S   100.0  0.0   0:00.17
>> python -m + <--pyspark.daemon
>> 13991 spark 20   0   46524  15572   4124 S   98.0  0.0   0:08.18
>> python -m + <--pyspark.daemon
>> 14488 spark 20   0   46524  15636   4188 S   98.0  0.0   0:07.25
>> python -m + <--pyspark.daemon
>> 14514 spark 20   0   46524  15636   4188 S   94.0  0.0   0:06.72
>> python -m + <--pyspark.daemon
>> 14526 spark 20   0   48200  17172   4092 S   0.0  0.0   0:00.38
>> python -m + <--pyspark.daemon
>>
>>
>>
>> Is there any way to control the number of pyspark.daemon processes that
>> get spawned ?
>>
>> Thank you
>> Agateaaa
>>
>> On Sun, Mar 27, 2016 at 1:08 AM, Sven Krasser  wrote:
>>
>>> Hey Ken,
>>>
>>> 1. You're correct, cached RDDs live on the JVM heap. (There's an
>>> off-heap storage option using Alluxio, formerly Tachyon, with which I have
>>> no experience however.)
>>>
>>> 2. The worker memory setting is not a hard maximum unfortunately. What
>>> happens is that during aggregation the Python daemon will check its process
>>> size. If the size is larger than this setting, it will start spilling to
>>> disk. I've seen many occasions where my daemons grew larger. Also, you're
>>> relying on Python's memory management to free up space again once objects
>>> are evicted. In practice, leave this setting reasonably small but make sure
>>> there's enough free memory on the machine so you don't run into OOM
>>> conditions. If the lower memory setting causes strains for your users, make
>>> sure they increase the parallelism of their jobs (smaller partitions
>>> meaning less data is processed at a time).
>>>
>>> 3. I believe that is the behavior you can expect when setting
>>> spark.executor.cores. I've not experimented much with it and haven't looked
>>> at that part of the code, but what you describe also reflects my
>>> understanding. Please share your findings here, I'm sure those will be very
>>> helpful to others, too.
>>>
>>> One more suggestion for your users is to move to the Pyspark DataFrame
>>> API. Much of the processing will then happen in the JVM, and you will bump
>>> into fewer Python resource contention issues.
>>>
>>> Best,
>>> -Sven
>>>
>>>
>>> On Sat, Mar 26, 2016 at 1:38 PM, Carlile, Ken 

java server error - spark

2016-06-15 Thread spR
Hi,
I am getting this error while executing a query using sqlcontext.sql

The table has around 2.5 gb of data to be scanned.

First I get out of memory exception. But I have 16 gb of ram

Then my notebook dies and I get below error

Py4JNetworkError: An error occurred while trying to connect to the Java server


Thank You


unsubscribe

2016-06-15 Thread Sanjeev Sagar
unsubscribe


[ANNOUNCE] Apache SystemML 0.10.0-incubating released

2016-06-15 Thread Luciano Resende
The Apache SystemML team is pleased to announce the release of Apache
SystemML version 0.10.0-incubating.

Apache SystemML provides declarative large-scale machine learning (ML) that
aims at flexible specification of ML algorithms and automatic generation of
hybrid runtime plans ranging from single-node, in-memory computations, to
distributed computations on Apache Hadoop MapReduce and Apache Spark.

Extensive updates have been made to the release in several areas. These
include APIs, data ingestion, optimizations, language and runtime
operators, new algorithms, testing, and online documentation. For detailed
information about the updates, please access the release notes available at
:

http://systemml.apache.org/0.10.0-incubating/release_notes.html

To download the distribution, please go to :

http://systemml.apache.org/

The Apache SystemML Team

---
Apache SystemML is an effort undergoing Incubation
 at The Apache Software Foundation
(ASF), sponsored by the Incubator. Incubation is required of all newly
accepted projects until a further review indicates that the infrastructure,
communications, and decision making process have stabilized in a manner
consistent with other successful ASF projects. While incubation status is
not necessarily a reflection of the completeness or stability of the code,
it does indicate that the project has yet to be fully endorsed by the ASF.


Re: concat spark dataframes

2016-06-15 Thread spR
Hey,

There are quite a lot of fields. But, there are no common fields between
the 2 dataframes. Can I not concatenate the 2 frames like we can do in
pandas such that the resulting dataframe has columns from both the
dataframes?

Thank you.

Regards,
Misha



On Wed, Jun 15, 2016 at 3:44 PM, Mohammed Guller 
wrote:

> Hi Misha,
>
> What is the schema for both the DataFrames? And what is the expected
> schema of the resulting DataFrame?
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Natu Lauchande [mailto:nlaucha...@gmail.com]
> *Sent:* Wednesday, June 15, 2016 2:07 PM
> *To:* spR
> *Cc:* user
> *Subject:* Re: concat spark dataframes
>
>
>
> Hi,
>
> You can select the common collumns and use DataFrame.union all .
>
> Regards,
>
> Natu
>
>
>
> On Wed, Jun 15, 2016 at 8:57 PM, spR  wrote:
>
> hi,
>
>
>
> how to concatenate spark dataframes? I have 2 frames with certain columns.
> I want to get a dataframe with columns from both the other frames.
>
>
>
> Regards,
>
> Misha
>
>
>


RE: Spark 2.0 release date

2016-06-15 Thread Mohammed Guller
Andy – instead of Naïve Bayes, you should have used the Multi-layer Perceptron 
classifier ☺

Mohammed
Author: Big Data Analytics with 
Spark

From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Wednesday, June 15, 2016 7:57 AM
To: Ted Yu
Cc: Mich Talebzadeh; Chaturvedi Chola; user @spark
Subject: Re: Spark 2.0 release date

Yeah well... the prior was high... but don't have enough data on Mich to have 
an accurate likelihood :-)
But ok, my bad, I continue with the preview stuff and leave this thread in 
peace ^^
tx ted
cheers

On Wed, Jun 15, 2016 at 4:47 PM Ted Yu 
> wrote:
Andy:
You should sense the tone in Mich's response.

To my knowledge, there hasn't been an RC for the 2.0 release yet.
Once we have an RC, it goes through the normal voting process.

FYI

On Wed, Jun 15, 2016 at 7:38 AM, andy petrella 
> wrote:
> tomorrow lunch time
Which TZ :-) → I'm working on the update of some materials that Dean Wampler 
and myself will give tomorrow at Scala 
Days
 (well tomorrow CEST).

Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think 2.0.0 
will be released before 6PM CEST (9AM PDT)? I don't want to be a joke in front 
of the audience with my almost cutting edge version :-P

tx


On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh 
> wrote:
Tomorrow lunchtime.

Btw can you stop spamming every big data forum about good interview questions 
book for big data!

I have seen your mails on this big data book in spark, hive and tez forums and 
I am sure there are many others. That seems to be the only mail you send around.

This forum is for technical discussions not for promotional material. Please 
confine yourself to technical matters






Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 15 June 2016 at 12:45, Chaturvedi Chola 
> wrote:
when is the spark 2.0 release planned

--
andy

--
andy


Re: Limit pyspark.daemon threads

2016-06-15 Thread Sudhir Babu Pothineni
Hi Ken, It may be also related to Grid Engine job scheduling? If it is 16 core 
(virtual cores?), grid engine allocates 16 slots, If you use 'max' scheduling, 
it will send 16 processes sequentially to same machine, on the top of it each 
spark job has its own executors. Limit the number of jobs scheduled to the 
machine = number of physical cores of single CPU, it will solve the problem if 
it is related to GE. If you are sure it's related to Spark, please ignore.

-Sudhir


Sent from my iPhone

> On Jun 15, 2016, at 8:53 AM, Gene Pang  wrote:
> 
> As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory, and 
> you can then share that RDD across different jobs. If you would like to run 
> Spark on Alluxio, this documentation can help: 
> http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html
> 
> Thanks,
> Gene
> 
>> On Tue, Jun 14, 2016 at 12:44 AM, agateaaa  wrote:
>> Hi,
>> 
>> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have set 
>> spark.executor.cores to 1, but I see that whenever streaming batch starts 
>> processing data, see python -m pyspark.daemon processes increase gradually 
>> to about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daemon 
>> takes up around 100 % CPU) 
>> 
>> After the processing is done 4 pyspark.daemon processes go away and we are 
>> left with one till the next batch run. Also sometimes the  CPU usage for 
>> executor process spikes to about 800% even though spark.executor.core is set 
>> to 1
>> 
>> e.g. top output
>> PID USER  PR   NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
>> 19634 spark 20   0 8871420 1.790g  32056 S 814.1  2.9   0:39.33 
>> /usr/lib/j+ <--EXECUTOR
>> 
>> 13897 spark 20   0   46576  17916   6720 S   100.0  0.0   0:00.17 python 
>> -m + <--pyspark.daemon
>> 13991 spark 20   0   46524  15572   4124 S   98.0  0.0   0:08.18 python 
>> -m + <--pyspark.daemon
>> 14488 spark 20   0   46524  15636   4188 S   98.0  0.0   0:07.25 python 
>> -m + <--pyspark.daemon
>> 14514 spark 20   0   46524  15636   4188 S   94.0  0.0   0:06.72 python 
>> -m + <--pyspark.daemon
>> 14526 spark 20   0   48200  17172   4092 S   0.0  0.0   0:00.38 python 
>> -m + <--pyspark.daemon
>> 
>> 
>> 
>> Is there any way to control the number of pyspark.daemon processes that get 
>> spawned ?
>> 
>> Thank you
>> Agateaaa
>> 
>>> On Sun, Mar 27, 2016 at 1:08 AM, Sven Krasser  wrote:
>>> Hey Ken,
>>> 
>>> 1. You're correct, cached RDDs live on the JVM heap. (There's an off-heap 
>>> storage option using Alluxio, formerly Tachyon, with which I have no 
>>> experience however.)
>>> 
>>> 2. The worker memory setting is not a hard maximum unfortunately. What 
>>> happens is that during aggregation the Python daemon will check its process 
>>> size. If the size is larger than this setting, it will start spilling to 
>>> disk. I've seen many occasions where my daemons grew larger. Also, you're 
>>> relying on Python's memory management to free up space again once objects 
>>> are evicted. In practice, leave this setting reasonably small but make sure 
>>> there's enough free memory on the machine so you don't run into OOM 
>>> conditions. If the lower memory setting causes strains for your users, make 
>>> sure they increase the parallelism of their jobs (smaller partitions 
>>> meaning less data is processed at a time).
>>> 
>>> 3. I believe that is the behavior you can expect when setting 
>>> spark.executor.cores. I've not experimented much with it and haven't looked 
>>> at that part of the code, but what you describe also reflects my 
>>> understanding. Please share your findings here, I'm sure those will be very 
>>> helpful to others, too.
>>> 
>>> One more suggestion for your users is to move to the Pyspark DataFrame API. 
>>> Much of the processing will then happen in the JVM, and you will bump into 
>>> fewer Python resource contention issues.
>>> 
>>> Best,
>>> -Sven
>>> 
>>> 
 On Sat, Mar 26, 2016 at 1:38 PM, Carlile, Ken  
 wrote:
 This is extremely helpful!
 
 I’ll have to talk to my users about how the python memory limit should be 
 adjusted and what their expectations are. I’m fairly certain we bumped it 
 up in the dark past when jobs were failing because of insufficient memory 
 for the python processes. 
 
 So just to make sure I’m understanding correctly: 
 
 JVM memory (set by SPARK_EXECUTOR_MEMORY and/or SPARK_WORKER_MEMORY?) is 
 where the RDDs are stored. Currently both of those values are set to 90GB
 spark.python.worker.memory controls how much RAM each python task can take 
 maximum (roughly speaking. Currently set to 4GB
 spark.task.cpus controls how many java worker threads will exist and thus 
 indirectly how many pyspark daemon processes will exist
 
 I’m also looking into fixing my cron jobs so 

RE: concat spark dataframes

2016-06-15 Thread Mohammed Guller
Hi Misha,
What is the schema for both the DataFrames? And what is the expected schema of 
the resulting DataFrame?

Mohammed
Author: Big Data Analytics with 
Spark

From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Wednesday, June 15, 2016 2:07 PM
To: spR
Cc: user
Subject: Re: concat spark dataframes

Hi,
You can select the common collumns and use DataFrame.union all .
Regards,
Natu

On Wed, Jun 15, 2016 at 8:57 PM, spR 
> wrote:
hi,

how to concatenate spark dataframes? I have 2 frames with certain columns. I 
want to get a dataframe with columns from both the other frames.

Regards,
Misha



Re: HIVE Query 25x faster than SPARK Query

2016-06-15 Thread Mahender Sarangam
+1,

Even see performance degradation while comparing SPark SQL with Hive.
We have table of 260 columns. We have executed in hive and SPARK. In Hive, it 
is taking 66 sec for 1 gb of data whereas in Spark, it is taking 4 mins of time.

On 6/9/2016 3:19 PM, Gavin Yue wrote:
Could you print out the sql execution plan? My guess is about broadcast join.



On Jun 9, 2016, at 07:14, Gourav Sengupta 
<gourav.sengu...@gmail.com>
 wrote:

Hi,

Query1 is almost 25x faster in HIVE than in SPARK. What is happening here and 
is there a way we can optimize the queries in SPARK without the obvious hack in 
Query2.


---
ENVIRONMENT:
---

> Table A 533 columns x 24 million rows and Table B has 2 columns x 3 million 
> rows. Both the files are single gzipped csv file.
> Both table A and B are external tables in AWS S3 and created in HIVE accessed 
> through SPARK using HiveContext
> EMR 4.6, Spark 1.6.1 and Hive 1.0.0 (clusters started using 
> allowMaximumResource allocation and node types are c3.4xlarge).

--
QUERY1:
--
select A.PK, B.FK
from A
left outer join B on (A.PK = B.FK)
where B.FK is not null;



This query takes 4 mins in HIVE and 1.1 hours in SPARK


--
QUERY 2:
--

select A.PK, B.FK
from (select PK from A) A
left outer join B on (A.PK = B.FK)
where B.FK is not null;

This query takes 4.5 mins in SPARK



Regards,
Gourav Sengupta






Re: Limit pyspark.daemon threads

2016-06-15 Thread agateaaa
Yes have set spark.cores.max to 3. I have three worker nodes in my spark
cluster (standalone mode), and spark.executor.cores is set to 1.  On each
worker node whenever the streaming application runs, I see 4 pyspark.daemon
processes get spawned. Each pyspark.daemon process takes up approx 1 CPU
causing 4 CPU's getting utilized on each worker node.

On Wed, Jun 15, 2016 at 9:51 PM, David Newberger <
david.newber...@wandcorp.com> wrote:

> Have you tried setting spark.cores.max
>
> “When running on a standalone deploy cluster
>  or a Mesos
> cluster in "coarse-grained" sharing mode
> ,
> the maximum amount of CPU cores to request for the application from across
> the cluster (not from each machine). If not set, the default will be
> spark.deploy.defaultCores on Spark's standalone cluster manager, or
> infinite (all available cores) on Mesos.”
>
>
>
> *David Newberger*
>
>
>
> *From:* agateaaa [mailto:agate...@gmail.com]
> *Sent:* Wednesday, June 15, 2016 4:39 PM
> *To:* Gene Pang
> *Cc:* Sven Krasser; Carlile, Ken; user
> *Subject:* Re: Limit pyspark.daemon threads
>
>
>
> Thx Gene! But my concern is with CPU usage not memory. I want to see if
> there is anyway to control the number of pyspark.daemon processes that get
> spawned. We have some restriction on number of CPU's we can use on a node,
> and number of pyspark.daemon processes that get created dont seem to honor
> spark.executor.cores property setting
>
> Thanks!
>
>
>
> On Wed, Jun 15, 2016 at 1:53 PM, Gene Pang  wrote:
>
> As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory,
> and you can then share that RDD across different jobs. If you would like to
> run Spark on Alluxio, this documentation can help:
> http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html
>
>
>
> Thanks,
>
> Gene
>
>
>
> On Tue, Jun 14, 2016 at 12:44 AM, agateaaa  wrote:
>
> Hi,
>
> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have set
> spark.executor.cores to 1, but I see that whenever streaming batch starts
> processing data, see python -m pyspark.daemon processes increase gradually
> to about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daemon
> takes up around 100 % CPU)
>
> After the processing is done 4 pyspark.daemon processes go away and we are
> left with one till the next batch run. Also sometimes the  CPU usage for
> executor process spikes to about 800% even though spark.executor.core is
> set to 1
>
> e.g. top output
> PID USER  PR   NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
> 19634 spark 20   0 8871420 1.790g  32056 S 814.1  2.9   0:39.33
> /usr/lib/j+ <--EXECUTOR
>
> 13897 spark 20   0   46576  17916   6720 S   100.0  0.0   0:00.17
> python -m + <--pyspark.daemon
> 13991 spark 20   0   46524  15572   4124 S   98.0  0.0   0:08.18
> python -m + <--pyspark.daemon
> 14488 spark 20   0   46524  15636   4188 S   98.0  0.0   0:07.25
> python -m + <--pyspark.daemon
> 14514 spark 20   0   46524  15636   4188 S   94.0  0.0   0:06.72
> python -m + <--pyspark.daemon
> 14526 spark 20   0   48200  17172   4092 S   0.0  0.0   0:00.38 python
> -m + <--pyspark.daemon
>
>
>
> Is there any way to control the number of pyspark.daemon processes that
> get spawned ?
>
> Thank you
>
> Agateaaa
>
>
>
> On Sun, Mar 27, 2016 at 1:08 AM, Sven Krasser  wrote:
>
> Hey Ken,
>
>
>
> 1. You're correct, cached RDDs live on the JVM heap. (There's an off-heap
> storage option using Alluxio, formerly Tachyon, with which I have no
> experience however.)
>
>
>
> 2. The worker memory setting is not a hard maximum unfortunately. What
> happens is that during aggregation the Python daemon will check its process
> size. If the size is larger than this setting, it will start spilling to
> disk. I've seen many occasions where my daemons grew larger. Also, you're
> relying on Python's memory management to free up space again once objects
> are evicted. In practice, leave this setting reasonably small but make sure
> there's enough free memory on the machine so you don't run into OOM
> conditions. If the lower memory setting causes strains for your users, make
> sure they increase the parallelism of their jobs (smaller partitions
> meaning less data is processed at a time).
>
>
>
> 3. I believe that is the behavior you can expect when setting
> spark.executor.cores. I've not experimented much with it and haven't looked
> at that part of the code, but what you describe also reflects my
> understanding. Please share your findings here, I'm sure those will be very
> helpful to others, too.
>
>
>
> One more suggestion for your users is to move to the Pyspark DataFrame
> API. Much of the processing will then happen in the JVM, and you will bump
> into fewer Python resource contention issues.
>
>
>
> Best,
>

Re: processing 50 gb data using just one machine

2016-06-15 Thread Mich Talebzadeh
50gb of data is not much.

besides master local[4] what else do you have for other parameters?

${SPARK_HOME}/bin/spark-submit \

--driver-memory 4G \

--num-executors 1 \

--executor-memory 4G \

--master local[4] \



Try running it and check web GUI 4040 for resource usage

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 June 2016 at 17:03, spR  wrote:

> Hi,
>
> can I use spark in local mode using 4 cores to process 50gb data
> effeciently?
>
> Thank you
>
> misha
>


Re: What is the interpretation of Cores in Spark doc

2016-06-15 Thread Mich Talebzadeh
I think it is slightly more than that.

These days  software is licensed by core (generally speaking).   That is
the physical processor.   * A core may have one or more threads - or
logical processors*. Virtualization adds some fun to the mix.   Generally
what they present is ‘virtual processors’.   What that equates to depends
on the virtualization layer itself.   In some simpler VM’s - it is
virtual=logical.   In others, virtual=logical but they are constrained to
be from the same cores - e.g. if you get 6 virtual processors, it really is
3 full cores with 2 threads each.   Rational is due to the way OS
dispatching works on ‘logical’ processors vs. cores and POSIX threaded
applications.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 13 June 2016 at 18:17, Mark Hamstra  wrote:

> I don't know what documentation you were referring to, but this is clearly
> an erroneous statement: "Threads are virtual cores."  At best it is
> terminology abuse by a hardware manufacturer.  Regardless, Spark can't get
> too concerned about how any particular hardware vendor wants to refer to
> the specific components of their CPU architecture.  For us, a core is a
> logical execution unit, something on which a thread of execution can run.
> That can map in different ways to different physical or virtual hardware.
>
> On Mon, Jun 13, 2016 at 12:02 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi,
>>
>> It is not the issue of testing anything. I was referring to documentation
>> that clearly use the term "threads". As I said and showed before, one line
>> is using the term "thread" and the next one "logical cores".
>>
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 12 June 2016 at 23:57, Daniel Darabos <
>> daniel.dara...@lynxanalytics.com> wrote:
>>
>>> Spark is a software product. In software a "core" is something that a
>>> process can run on. So it's a "virtual core". (Do not call these "threads".
>>> A "thread" is not something a process can run on.)
>>>
>>> local[*] uses java.lang.Runtime.availableProcessors()
>>> .
>>> Since Java is software, this also returns the number of virtual cores. (You
>>> can test this easily.)
>>>
>>>
>>> On Sun, Jun 12, 2016 at 9:23 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Hi,

 I was writing some docs on Spark P and came across this.

 It is about the terminology or interpretation of that in Spark doc.

 This is my understanding of cores and threads.

  Cores are physical cores. Threads are virtual cores. Cores with 2
 threads is called hyper threading technology so 2 threads per core makes
 the core work on two loads at same time. In other words, every thread takes
 care of one load.

 Core has its own memory. So if you have a dual core with hyper
 threading, the core works with 2 loads each at same time because of the 2
 threads per core, but this 2 threads will share memory in that core.

 Some vendors as I am sure most of you aware charge licensing per core.

 For example on the same host that I have Spark, I have a SAP product
 that checks the licensing and shuts the application down if the license
 does not agree with the cores speced.

 This is what it says

 ./cpuinfo
 License hostid:00e04c69159a 0050b60fd1e7
 Detected 12 logical processor(s), 6 core(s), in 1 chip(s)

 So here I have 12 logical processors  and 6 cores and 1 chip. I call
 logical processors as threads so I have 12 threads?

 Now if I go and start worker process
 ${SPARK_HOME}/sbin/start-slaves.sh, I see this in GUI page

 [image: Inline images 1]

 it says 12 cores but I gather it is threads?


 Spark document
 
 states and I quote


 [image: Inline images 2]



 OK the line local[k] adds  ..  *set this to the number of cores on
 your machine*


 But I know that it means threads. Because if I went and set that to 6,
 it would be only 6 threads as opposed to 12 threads.


 the next line local[*] seems to indicate it correctly as it refers to
 "logical cores" that in my understanding it is threads.


 I trust that I am not nitpicking here!


 Cheers,



 Dr Mich 

Re: choice of RDD function

2016-06-15 Thread Cody Koeninger
Doesn't that result in consuming each RDD twice, in order to infer the
json schema?

On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S  wrote:
> Of course :)
>
> object sparkStreaming {
>   def main(args: Array[String]) {
> StreamingExamples.setStreamingLogLevels() //Set reasonable logging
> levels for streaming if the user has not configured log4j.
> val topics = "test"
> val brokers = "localhost:9092"
> val topicsSet = topics.split(",").toSet
> val sparkConf = new
> SparkConf().setAppName("KafkaDroneCalc").setMaster("local")
> //spark://localhost:7077
> val sc = new SparkContext(sparkConf)
> val ssc = new StreamingContext(sc, Seconds(30))
> val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
> val messages = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder] (ssc, kafkaParams, topicsSet)
> val lines = messages.map(_._2)
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> lines.foreachRDD( rdd => {
>   val df = sqlContext.read.json(rdd)
>   df.registerTempTable(“drone")
>   sqlContext.sql("SELECT id, AVG(temp), AVG(rotor_rpm),
> AVG(winddirection), AVG(windspeed) FROM drone GROUP BY id").show()
> })
> ssc.start()
> ssc.awaitTermination()
>   }
> }
>
> I haven’t checked long running performance though.
>
> Regards,
>
> Siva
>
> On 15-Jun-2016, at 5:02 PM, Jacek Laskowski  wrote:
>
> Hi,
>
> Good to hear so! Mind sharing a few snippets of your solution?
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S  wrote:
>
> Thanks Jacek,
>
> Job completed!! :) Just used data frames and sql query. Very clean and
> functional code.
>
> Siva
>
> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski  wrote:
>
> mapWithState
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Cody, 

Are you referring to the  val lines = messages.map(_._2)? 

Regards,

Siva

> On 15-Jun-2016, at 10:32 PM, Cody Koeninger  wrote:
> 
> Doesn't that result in consuming each RDD twice, in order to infer the
> json schema?
> 
> On Wed, Jun 15, 2016 at 11:19 AM, Sivakumaran S  wrote:
>> Of course :)
>> 
>> object sparkStreaming {
>>  def main(args: Array[String]) {
>>StreamingExamples.setStreamingLogLevels() //Set reasonable logging
>> levels for streaming if the user has not configured log4j.
>>val topics = "test"
>>val brokers = "localhost:9092"
>>val topicsSet = topics.split(",").toSet
>>val sparkConf = new
>> SparkConf().setAppName("KafkaDroneCalc").setMaster("local")
>> //spark://localhost:7077
>>val sc = new SparkContext(sparkConf)
>>val ssc = new StreamingContext(sc, Seconds(30))
>>val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
>>val messages = KafkaUtils.createDirectStream[String, String,
>> StringDecoder, StringDecoder] (ssc, kafkaParams, topicsSet)
>>val lines = messages.map(_._2)
>>val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>>lines.foreachRDD( rdd => {
>>  val df = sqlContext.read.json(rdd)
>>  df.registerTempTable(“drone")
>>  sqlContext.sql("SELECT id, AVG(temp), AVG(rotor_rpm),
>> AVG(winddirection), AVG(windspeed) FROM drone GROUP BY id").show()
>>})
>>ssc.start()
>>ssc.awaitTermination()
>>  }
>> }
>> 
>> I haven’t checked long running performance though.
>> 
>> Regards,
>> 
>> Siva
>> 
>> On 15-Jun-2016, at 5:02 PM, Jacek Laskowski  wrote:
>> 
>> Hi,
>> 
>> Good to hear so! Mind sharing a few snippets of your solution?
>> 
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>> 
>> 
>> On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S  wrote:
>> 
>> Thanks Jacek,
>> 
>> Job completed!! :) Just used data frames and sql query. Very clean and
>> functional code.
>> 
>> Siva
>> 
>> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski  wrote:
>> 
>> mapWithState
>> 
>> 
>> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Limit pyspark.daemon threads

2016-06-15 Thread David Newberger
Have you tried setting spark.cores.max

“When running on a standalone deploy 
cluster or a Mesos 
cluster in "coarse-grained" sharing 
mode,
 the maximum amount of CPU cores to request for the application from across the 
cluster (not from each machine). If not set, the default will 
bespark.deploy.defaultCores on Spark's standalone cluster manager, or infinite 
(all available cores) on Mesos.”

David Newberger

From: agateaaa [mailto:agate...@gmail.com]
Sent: Wednesday, June 15, 2016 4:39 PM
To: Gene Pang
Cc: Sven Krasser; Carlile, Ken; user
Subject: Re: Limit pyspark.daemon threads

Thx Gene! But my concern is with CPU usage not memory. I want to see if there 
is anyway to control the number of pyspark.daemon processes that get spawned. 
We have some restriction on number of CPU's we can use on a node, and number of 
pyspark.daemon processes that get created dont seem to honor 
spark.executor.cores property setting
Thanks!

On Wed, Jun 15, 2016 at 1:53 PM, Gene Pang 
> wrote:
As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory, and 
you can then share that RDD across different jobs. If you would like to run 
Spark on Alluxio, this documentation can help: 
http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html

Thanks,
Gene

On Tue, Jun 14, 2016 at 12:44 AM, agateaaa 
> wrote:
Hi,
I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have set 
spark.executor.cores to 1, but I see that whenever streaming batch starts 
processing data, see python -m pyspark.daemon processes increase gradually to 
about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daemon takes 
up around 100 % CPU)
After the processing is done 4 pyspark.daemon processes go away and we are left 
with one till the next batch run. Also sometimes the  CPU usage for executor 
process spikes to about 800% even though spark.executor.core is set to 1
e.g. top output
PID USER  PR   NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
19634 spark 20   0 8871420 1.790g  32056 S 814.1  2.9   0:39.33 /usr/lib/j+ 
<--EXECUTOR

13897 spark 20   0   46576  17916   6720 S   100.0  0.0   0:00.17 python -m 
+ <--pyspark.daemon
13991 spark 20   0   46524  15572   4124 S   98.0  0.0   0:08.18 python -m 
+ <--pyspark.daemon
14488 spark 20   0   46524  15636   4188 S   98.0  0.0   0:07.25 python -m 
+ <--pyspark.daemon
14514 spark 20   0   46524  15636   4188 S   94.0  0.0   0:06.72 python -m 
+ <--pyspark.daemon
14526 spark 20   0   48200  17172   4092 S   0.0  0.0   0:00.38 python -m + 
<--pyspark.daemon


Is there any way to control the number of pyspark.daemon processes that get 
spawned ?
Thank you
Agateaaa

On Sun, Mar 27, 2016 at 1:08 AM, Sven Krasser 
> wrote:
Hey Ken,

1. You're correct, cached RDDs live on the JVM heap. (There's an off-heap 
storage option using Alluxio, formerly Tachyon, with which I have no experience 
however.)

2. The worker memory setting is not a hard maximum unfortunately. What happens 
is that during aggregation the Python daemon will check its process size. If 
the size is larger than this setting, it will start spilling to disk. I've seen 
many occasions where my daemons grew larger. Also, you're relying on Python's 
memory management to free up space again once objects are evicted. In practice, 
leave this setting reasonably small but make sure there's enough free memory on 
the machine so you don't run into OOM conditions. If the lower memory setting 
causes strains for your users, make sure they increase the parallelism of their 
jobs (smaller partitions meaning less data is processed at a time).

3. I believe that is the behavior you can expect when setting 
spark.executor.cores. I've not experimented much with it and haven't looked at 
that part of the code, but what you describe also reflects my understanding. 
Please share your findings here, I'm sure those will be very helpful to others, 
too.

One more suggestion for your users is to move to the Pyspark DataFrame API. 
Much of the processing will then happen in the JVM, and you will bump into 
fewer Python resource contention issues.

Best,
-Sven


On Sat, Mar 26, 2016 at 1:38 PM, Carlile, Ken 
> wrote:
This is extremely helpful!

I’ll have to talk to my users about how the python memory limit should be 
adjusted and what their expectations are. I’m fairly certain we bumped it up in 
the dark past when jobs were failing because of insufficient memory for the 
python processes.

So just to make sure I’m understanding correctly:


  *   JVM memory (set by SPARK_EXECUTOR_MEMORY and/or SPARK_WORKER_MEMORY?) is 
where the RDDs are stored. Currently both 

Re: Limit pyspark.daemon threads

2016-06-15 Thread agateaaa
Thx Gene! But my concern is with CPU usage not memory. I want to see if
there is anyway to control the number of pyspark.daemon processes that get
spawned. We have some restriction on number of CPU's we can use on a node,
and number of pyspark.daemon processes that get created dont seem to honor
spark.executor.cores property setting

Thanks!

On Wed, Jun 15, 2016 at 1:53 PM, Gene Pang  wrote:

> As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory,
> and you can then share that RDD across different jobs. If you would like to
> run Spark on Alluxio, this documentation can help:
> http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html
>
> Thanks,
> Gene
>
> On Tue, Jun 14, 2016 at 12:44 AM, agateaaa  wrote:
>
>> Hi,
>>
>> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have set
>> spark.executor.cores to 1, but I see that whenever streaming batch starts
>> processing data, see python -m pyspark.daemon processes increase gradually
>> to about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daemon
>> takes up around 100 % CPU)
>>
>> After the processing is done 4 pyspark.daemon processes go away and we
>> are left with one till the next batch run. Also sometimes the  CPU usage
>> for executor process spikes to about 800% even though spark.executor.core
>> is set to 1
>>
>> e.g. top output
>> PID USER  PR   NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
>> 19634 spark 20   0 8871420 1.790g  32056 S 814.1  2.9   0:39.33
>> /usr/lib/j+ <--EXECUTOR
>>
>> 13897 spark 20   0   46576  17916   6720 S   100.0  0.0   0:00.17
>> python -m + <--pyspark.daemon
>> 13991 spark 20   0   46524  15572   4124 S   98.0  0.0   0:08.18
>> python -m + <--pyspark.daemon
>> 14488 spark 20   0   46524  15636   4188 S   98.0  0.0   0:07.25
>> python -m + <--pyspark.daemon
>> 14514 spark 20   0   46524  15636   4188 S   94.0  0.0   0:06.72
>> python -m + <--pyspark.daemon
>> 14526 spark 20   0   48200  17172   4092 S   0.0  0.0   0:00.38
>> python -m + <--pyspark.daemon
>>
>>
>>
>> Is there any way to control the number of pyspark.daemon processes that
>> get spawned ?
>>
>> Thank you
>> Agateaaa
>>
>> On Sun, Mar 27, 2016 at 1:08 AM, Sven Krasser  wrote:
>>
>>> Hey Ken,
>>>
>>> 1. You're correct, cached RDDs live on the JVM heap. (There's an
>>> off-heap storage option using Alluxio, formerly Tachyon, with which I have
>>> no experience however.)
>>>
>>> 2. The worker memory setting is not a hard maximum unfortunately. What
>>> happens is that during aggregation the Python daemon will check its process
>>> size. If the size is larger than this setting, it will start spilling to
>>> disk. I've seen many occasions where my daemons grew larger. Also, you're
>>> relying on Python's memory management to free up space again once objects
>>> are evicted. In practice, leave this setting reasonably small but make sure
>>> there's enough free memory on the machine so you don't run into OOM
>>> conditions. If the lower memory setting causes strains for your users, make
>>> sure they increase the parallelism of their jobs (smaller partitions
>>> meaning less data is processed at a time).
>>>
>>> 3. I believe that is the behavior you can expect when setting
>>> spark.executor.cores. I've not experimented much with it and haven't looked
>>> at that part of the code, but what you describe also reflects my
>>> understanding. Please share your findings here, I'm sure those will be very
>>> helpful to others, too.
>>>
>>> One more suggestion for your users is to move to the Pyspark DataFrame
>>> API. Much of the processing will then happen in the JVM, and you will bump
>>> into fewer Python resource contention issues.
>>>
>>> Best,
>>> -Sven
>>>
>>>
>>> On Sat, Mar 26, 2016 at 1:38 PM, Carlile, Ken >> > wrote:
>>>
 This is extremely helpful!

 I’ll have to talk to my users about how the python memory limit should
 be adjusted and what their expectations are. I’m fairly certain we bumped
 it up in the dark past when jobs were failing because of insufficient
 memory for the python processes.

 So just to make sure I’m understanding correctly:


- JVM memory (set by SPARK_EXECUTOR_MEMORY and/or
SPARK_WORKER_MEMORY?) is where the RDDs are stored. Currently both of 
 those
values are set to 90GB
- spark.python.worker.memory controls how much RAM each python task
can take maximum (roughly speaking. Currently set to 4GB
- spark.task.cpus controls how many java worker threads will exist
and thus indirectly how many pyspark daemon processes will exist


 I’m also looking into fixing my cron jobs so they don’t stack up by
 implementing flock in the jobs and changing how teardowns of the spark
 cluster work as far as failed workers.

 Thanks again,

RE: restarting of spark streaming

2016-06-15 Thread Chen, Yan I
Could anyone answer my question?

_
From: Chen, Yan I
Sent: 2016, June, 14 1:34 PM
To: 'user@spark.apache.org'
Subject: restarting of spark streaming


Hi,

I notice that in the process of restarting, spark streaming will try to 
recover/replay all the batches it missed. But in this process, will streams be 
checkpointed like the way they are checkpointed in the normal process?

Does anyone know?

Sometimes our cluster goes maintenance, and our streaming process is shutdown 
for e.g. 1 day and restarted. If replaying batches in this period of time 
without checkpointing, the RDD chain will be very big, and memory usage will 
keep going up until all missing batches are replayed.

[memory usage will keep going up until all missing batches are replayed]: this 
is what we observe now.

Thanks,
Yan Chen

___
If you received this email in error, please advise the sender (by return email 
or otherwise) immediately. You have consented to receive the attached 
electronically at the above-noted email address; please retain a copy of this 
confirmation for future reference.  

Si vous recevez ce courriel par erreur, veuillez en aviser l'expéditeur 
immédiatement, par retour de courriel ou par un autre moyen. Vous avez accepté 
de recevoir le(s) document(s) ci-joint(s) par voie électronique à l'adresse 
courriel indiquée ci-dessus; veuillez conserver une copie de cette confirmation 
pour les fins de reference future.


Re: concat spark dataframes

2016-06-15 Thread Natu Lauchande
Hi,

You can select the common collumns and use DataFrame.union all .

Regards,
Natu

On Wed, Jun 15, 2016 at 8:57 PM, spR  wrote:

> hi,
>
> how to concatenate spark dataframes? I have 2 frames with certain columns.
> I want to get a dataframe with columns from both the other frames.
>
> Regards,
> Misha
>


Re: Reporting warnings from workers

2016-06-15 Thread Ted Yu
Have you looked at:

https://spark.apache.org/docs/latest/programming-guide.html#accumulators

On Wed, Jun 15, 2016 at 1:24 PM, Mathieu Longtin 
wrote:

> Is there a way to report warnings from the workers back to the driver
> process?
>
> Let's say I have an RDD and do this:
>
> newrdd = rdd.map(somefunction)
>
> In *somefunction*, I want to catch when there are invalid values in *rdd *and
> either put them in another RDD or send some sort of message back.
>
> Is that possible?
> --
> Mathieu Longtin
> 1-514-803-8977
>


ERROR TaskResultGetter: Exception while getting task result java.io.IOException: java.lang.ClassNotFoundException: scala.Some

2016-06-15 Thread S Sarkar
Hello,

I built package for a spark application with the following sbt file:

name := "Simple Project"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies ++= Seq(
  "org.apache.spark"  %% "spark-core"  % "1.4.0" % "provided",
  "org.apache.spark"  %% "spark-mllib" % "1.4.0",
  "org.apache.spark"  %% "spark-sql"   % "1.4.0",
  "org.apache.spark"  %% "spark-sql"   % "1.4.0"
  )
resolvers += "Akka Repository" at "http://repo.akka.io/releases/;

I am getting TaskResultGetter error with ClassNotFoundException for
scala.Some .

Can I please get some help how to fix it?  

Thanks,
S. Sarkar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-TaskResultGetter-Exception-while-getting-task-result-java-io-IOException-java-lang-ClassNotFoue-tp27178.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Reporting warnings from workers

2016-06-15 Thread Mathieu Longtin
Is there a way to report warnings from the workers back to the driver
process?

Let's say I have an RDD and do this:

newrdd = rdd.map(somefunction)

In *somefunction*, I want to catch when there are invalid values in *rdd *and
either put them in another RDD or send some sort of message back.

Is that possible?
-- 
Mathieu Longtin
1-514-803-8977


data too long

2016-06-15 Thread spR
I am trying to save a spark dataframe in the mysql database by using:

df.write(sql_url, table='db.table')

the first column in the dataframe seems too long and I get this error :

Data too long for column 'custid' at row 1


what should I do?


Thanks


Re: IllegalArgumentException UnsatisfiedLinkError snappy-1.1.2 spark-shell error

2016-06-15 Thread Arul Ramachandran
Hi Paolo,
Were you able to get this resolved? I am hitting this issue, can you please
share what was your solution.
Thanks

On Mon, Feb 15, 2016 at 7:49 PM, Paolo Villaflores 
wrote:

>
> Yes, I have sen that. But java.io.tmpdir has a default definition in
> linux--it is /tmp.
>
>
>
> On Tue, Feb 16, 2016 at 2:17 PM, Ted Yu  wrote:
>
>> Have you seen this thread ?
>>
>>
>> http://search-hadoop.com/m/q3RTtW43zT1e2nfb=Re+ibsnappyjava+so+failed+to+map+segment+from+shared+object
>>
>> On Mon, Feb 15, 2016 at 7:09 PM, Paolo Villaflores <
>> pbvillaflo...@gmail.com> wrote:
>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> I am trying to run spark 1.6.0.
>>>
>>> I have previously just installed a fresh instance of hadoop 2.6.0 and
>>> hive 0.14.
>>>
>>> Hadoop, mapreduce, hive and beeline are working.
>>>
>>> However, as soon as I run `sc.textfile()` within spark-shell, it returns
>>> an error:
>>>
>>>
>>> $ spark-shell
>>> Welcome to
>>>     __
>>>  / __/__  ___ _/ /__
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>>   /_/
>>>
>>> Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.7.0_67)
>>> Type in expressions to have them evaluated.
>>> Type :help for more information.
>>> Spark context available as sc.
>>> SQL context available as sqlContext.
>>>
>>> scala> val textFile = sc.textFile("README.md")
>>> java.lang.IllegalArgumentException: java.lang.UnsatisfiedLinkError:
>>> /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
>>> /tmp/snappy-1.1.2-2ccaf764-c7c4-4ff1-a68e-bbfdec0a3aa1-libsnappyjava.so:
>>> failed to map segment from shared object: Operation not permitted
>>> at
>>> org.apache.spark.io.SnappyCompressionCodec.(CompressionCodec.scala:156)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>>> at
>>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>>> at
>>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>>> at
>>> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>>> at
>>> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:72)
>>> at
>>> org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:65)
>>> at org.apache.spark.broadcast.TorrentBroadcast.org
>>> $apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:80)
>>> at
>>> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>>> at
>>> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>>> at
>>> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1326)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1014)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1011)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>>> at
>>> org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
>>> at
>>> org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1011)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:832)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:830)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>>> at
>>> org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
>>> at
>>> org.apache.spark.SparkContext.textFile(SparkContext.scala:830)
>>> at
>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:27)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:32)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:34)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:36)
>>> at $iwC$$iwC$$iwC$$iwC.(:38)
>>> at $iwC$$iwC$$iwC.(:40)
>>> at $iwC$$iwC.(:42)
>>> at $iwC.(:44)
>>> at (:46)
>>> at .(:50)
>>> at .()
>>> at .(:7)
>>> at .()
>>> at $print()
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native
>>> Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>> at
>>> 

concat spark dataframes

2016-06-15 Thread spR
hi,

how to concatenate spark dataframes? I have 2 frames with certain columns.
I want to get a dataframe with columns from both the other frames.

Regards,
Misha


Re: spark standalone High availibilty issues

2016-06-15 Thread dhruve ashar
NoMethodFound seems that you are using incompatible versions of jars.

Check your dependencies, they might be outdated. Updating the version or
getting the right ones usually solves this issue.


On Wed, Jun 15, 2016 at 9:04 AM, Jacek Laskowski  wrote:

> Can you post the error?
>
> Jacek
> On 14 Jun 2016 10:56 p.m., "Darshan Singh"  wrote:
>
>> Hi,
>>
>> I am using standalone spark cluster and using zookeeper cluster for the
>> high availbilty. I am getting sometimes error when I start the master. The
>> error is related to Leader election in curator and says that noMethod found
>> (getProcess) and master doesnt get started.
>>
>> Just wondering what could be causing the issue.
>>
>> I am using same zookeeper cluster for HDFS High availability and it is
>> working just fine.
>>
>>
>> Thanks
>>
>


-- 
-Dhruve Ashar


Re: How to insert data into 2000 partitions(directories) of ORC/parquet at a time using Spark SQL?

2016-06-15 Thread swetha kasireddy
Hi Mich,

No I have not tried that. My requirement is to insert that from an hourly
Spark Batch job. How is it different by trying to insert with Hive CLI or
beeline?

Thanks,
Swetha



On Tue, Jun 14, 2016 at 10:44 AM, Mich Talebzadeh  wrote:

> Hi Swetha,
>
> Have you actually tried doing this in Hive using Hive CLI or beeline?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 June 2016 at 18:43, Mich Talebzadeh 
> wrote:
>
>> In all probability there is no user database created in Hive
>>
>> Create a database yourself
>>
>> sql("create if not exists database test")
>>
>> It would be helpful if you grasp some concept of Hive databases etc?
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 June 2016 at 15:40, Sree Eedupuganti  wrote:
>>
>>> Hi Spark users, i am new to spark. I am trying to connect hive using
>>> SparkJavaContext. Unable to connect to the database. By executing the below
>>> code i can see only "default" database. Can anyone help me out. What i need
>>> is a sample program for Querying Hive results using SparkJavaContext. Need
>>> to pass any values like this.
>>>
>>> userDF.registerTempTable("userRecordsTemp")
>>>
>>> sqlContext.sql("SET hive.default.fileformat=Orc  ")
>>> sqlContext.sql("set hive.enforce.bucketing = true; ")
>>> sqlContext.sql("set hive.enforce.sorting = true; ")
>>>
>>>  public static void  main(String[] args ) throws Exception {
>>>   SparkConf sparkConf = new
>>> SparkConf().setAppName("SparkSQL").setMaster("local");
>>>   SparkContext  ctx=new SparkContext(sparkConf);
>>>   HiveContext  hiveql=new
>>> org.apache.spark.sql.hive.HiveContext(ctx);
>>>   DataFrame df=hiveql.sql("show databases");
>>>   df.show();
>>>   }
>>>
>>> Any suggestions pleaseThanks.
>>>
>>
>>
>


Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
Thanks! got that. I was worried about the time itself.

On Wed, Jun 15, 2016 at 10:10 AM, Sergio Fernández 
wrote:

> In theory yes... the common sense say that:
>
> volume / resources = time
>
> So more volume on the same processing resources would just take more time.
> On Jun 15, 2016 6:43 PM, "spR"  wrote:
>
>> I have 16 gb ram, i7
>>
>> Will this config be able to handle the processing without my ipythin
>> notebook dying?
>>
>> The local mode is for testing purpose. But, I do not have any cluster at
>> my disposal. So can I make this work with the configuration that I have?
>> Thank you.
>> On Jun 15, 2016 9:40 AM, "Deepak Goel"  wrote:
>>
>>> What do you mean by "EFFECIENTLY"?
>>>
>>> Hey
>>>
>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>
>>>
>>>--
>>> Keigu
>>>
>>> Deepak
>>> 73500 12833
>>> www.simtree.net, dee...@simtree.net
>>> deic...@gmail.com
>>>
>>> LinkedIn: www.linkedin.com/in/deicool
>>> Skype: thumsupdeicool
>>> Google talk: deicool
>>> Blog: http://loveandfearless.wordpress.com
>>> Facebook: http://www.facebook.com/deicool
>>>
>>> "Contribute to the world, environment and more :
>>> http://www.gridrepublic.org
>>> "
>>>
>>> On Wed, Jun 15, 2016 at 9:33 PM, spR  wrote:
>>>
 Hi,

 can I use spark in local mode using 4 cores to process 50gb data
 effeciently?

 Thank you

 misha

>>>
>>>


Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
I meant local mode is testing purpose generally. But, I have to use the
entire 50gb data.

On Wed, Jun 15, 2016 at 10:14 AM, Deepak Goel  wrote:

> If it is just for test purpose, why not use a smaller size of data and
> test it on your notebook. When you go for the cluster, you can go for 50GB
> (I am a newbie so my thought would be very naive)
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
>
>--
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Wed, Jun 15, 2016 at 10:40 PM, Sergio Fernández 
> wrote:
>
>> In theory yes... the common sense say that:
>>
>> volume / resources = time
>>
>> So more volume on the same processing resources would just take more time.
>> On Jun 15, 2016 6:43 PM, "spR"  wrote:
>>
>>> I have 16 gb ram, i7
>>>
>>> Will this config be able to handle the processing without my ipythin
>>> notebook dying?
>>>
>>> The local mode is for testing purpose. But, I do not have any cluster at
>>> my disposal. So can I make this work with the configuration that I have?
>>> Thank you.
>>> On Jun 15, 2016 9:40 AM, "Deepak Goel"  wrote:
>>>
 What do you mean by "EFFECIENTLY"?

 Hey

 Namaskara~Nalama~Guten Tag~Bonjour


--
 Keigu

 Deepak
 73500 12833
 www.simtree.net, dee...@simtree.net
 deic...@gmail.com

 LinkedIn: www.linkedin.com/in/deicool
 Skype: thumsupdeicool
 Google talk: deicool
 Blog: http://loveandfearless.wordpress.com
 Facebook: http://www.facebook.com/deicool

 "Contribute to the world, environment and more :
 http://www.gridrepublic.org
 "

 On Wed, Jun 15, 2016 at 9:33 PM, spR  wrote:

> Hi,
>
> can I use spark in local mode using 4 cores to process 50gb data
> effeciently?
>
> Thank you
>
> misha
>


>


Re: processing 50 gb data using just one machine

2016-06-15 Thread Deepak Goel
If it is just for test purpose, why not use a smaller size of data and test
it on your notebook. When you go for the cluster, you can go for 50GB (I am
a newbie so my thought would be very naive)

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, dee...@simtree.net
deic...@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Wed, Jun 15, 2016 at 10:40 PM, Sergio Fernández 
wrote:

> In theory yes... the common sense say that:
>
> volume / resources = time
>
> So more volume on the same processing resources would just take more time.
> On Jun 15, 2016 6:43 PM, "spR"  wrote:
>
>> I have 16 gb ram, i7
>>
>> Will this config be able to handle the processing without my ipythin
>> notebook dying?
>>
>> The local mode is for testing purpose. But, I do not have any cluster at
>> my disposal. So can I make this work with the configuration that I have?
>> Thank you.
>> On Jun 15, 2016 9:40 AM, "Deepak Goel"  wrote:
>>
>>> What do you mean by "EFFECIENTLY"?
>>>
>>> Hey
>>>
>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>
>>>
>>>--
>>> Keigu
>>>
>>> Deepak
>>> 73500 12833
>>> www.simtree.net, dee...@simtree.net
>>> deic...@gmail.com
>>>
>>> LinkedIn: www.linkedin.com/in/deicool
>>> Skype: thumsupdeicool
>>> Google talk: deicool
>>> Blog: http://loveandfearless.wordpress.com
>>> Facebook: http://www.facebook.com/deicool
>>>
>>> "Contribute to the world, environment and more :
>>> http://www.gridrepublic.org
>>> "
>>>
>>> On Wed, Jun 15, 2016 at 9:33 PM, spR  wrote:
>>>
 Hi,

 can I use spark in local mode using 4 cores to process 50gb data
 effeciently?

 Thank you

 misha

>>>
>>>


Fwd: ERROR RetryingBlockFetcher: Exception while beginning fetch of 1 outstanding blocks

2016-06-15 Thread VG
>
> I have a very simple driver which loads a textFile and filters a
> sub-string from each line in the textfile.
> When the collect action is executed , I am getting an exception.   (The
> file is only 90 MB - so I am confused what is going on..) I am running on a
> local standalone cluster
>
> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_2_piece0 on
> 192.168.56.1:56413 in memory (size: 2.5 KB, free: 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Removed broadcast_1_piece0 on
> 192.168.56.1:56413 in memory (size: 1900.0 B, free: 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_1 on disk on
> 192.168.56.1:56413 (size: 2.7 MB)
> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_7 stored as bytes in
> memory (estimated size 2.7 MB, free 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_7 in memory on
> 192.168.56.1:56413 (size: 2.7 MB, free: 2.4 GB)
> 16/06/15 19:45:22 INFO Executor: Finished task 1.0 in stage 2.0 (TID 7).
> 2823777 bytes result sent via BlockManager)
> 16/06/15 19:45:22 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID
> 8, localhost, partition 2, PROCESS_LOCAL, 5422 bytes)
> 16/06/15 19:45:22 INFO Executor: Running task 2.0 in stage 2.0 (TID 8)
> 16/06/15 19:45:22 INFO HadoopRDD: Input split:
> file:/C:/Users/i303551/Downloads/ariba-logs/ssws/access.2016.04.26/access.2016.04.26:67108864+25111592
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added rdd_2_2 on disk on
> 192.168.56.1:56413 (size: 2.0 MB)
> 16/06/15 19:45:22 INFO MemoryStore: Block taskresult_8 stored as bytes in
> memory (estimated size 2.0 MB, free 2.4 GB)
> 16/06/15 19:45:22 INFO BlockManagerInfo: Added taskresult_8 in memory on
> 192.168.56.1:56413 (size: 2.0 MB, free: 2.4 GB)
> 16/06/15 19:45:22 INFO Executor: Finished task 2.0 in stage 2.0 (TID 8).
> 2143771 bytes result sent via BlockManager)
> 16/06/15 19:45:43 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to /192.168.56.1:56413
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:96)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
> at
> org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:105)
> at
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:92)
> at
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:546)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:76)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:57)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1793)
> at
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:56)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.net.ConnectException: Connection timed out: no further
> information: /192.168.56.1:56413
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> ... 1 more
> 16/06/15 19:45:43 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1
> outstanding blocks after 5000 ms
> 16/06/15 19:46:04 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to /192.168.56.1:56413
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:228)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179)
> at
> 

Re: processing 50 gb data using just one machine

2016-06-15 Thread Sergio Fernández
In theory yes... the common sense say that:

volume / resources = time

So more volume on the same processing resources would just take more time.
On Jun 15, 2016 6:43 PM, "spR"  wrote:

> I have 16 gb ram, i7
>
> Will this config be able to handle the processing without my ipythin
> notebook dying?
>
> The local mode is for testing purpose. But, I do not have any cluster at
> my disposal. So can I make this work with the configuration that I have?
> Thank you.
> On Jun 15, 2016 9:40 AM, "Deepak Goel"  wrote:
>
>> What do you mean by "EFFECIENTLY"?
>>
>> Hey
>>
>> Namaskara~Nalama~Guten Tag~Bonjour
>>
>>
>>--
>> Keigu
>>
>> Deepak
>> 73500 12833
>> www.simtree.net, dee...@simtree.net
>> deic...@gmail.com
>>
>> LinkedIn: www.linkedin.com/in/deicool
>> Skype: thumsupdeicool
>> Google talk: deicool
>> Blog: http://loveandfearless.wordpress.com
>> Facebook: http://www.facebook.com/deicool
>>
>> "Contribute to the world, environment and more :
>> http://www.gridrepublic.org
>> "
>>
>> On Wed, Jun 15, 2016 at 9:33 PM, spR  wrote:
>>
>>> Hi,
>>>
>>> can I use spark in local mode using 4 cores to process 50gb data
>>> effeciently?
>>>
>>> Thank you
>>>
>>> misha
>>>
>>
>>


Re: processing 50 gb data using just one machine

2016-06-15 Thread spR
I have 16 gb ram, i7

Will this config be able to handle the processing without my ipythin
notebook dying?

The local mode is for testing purpose. But, I do not have any cluster at my
disposal. So can I make this work with the configuration that I have? Thank
you.
On Jun 15, 2016 9:40 AM, "Deepak Goel"  wrote:

> What do you mean by "EFFECIENTLY"?
>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
>
>--
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>
> On Wed, Jun 15, 2016 at 9:33 PM, spR  wrote:
>
>> Hi,
>>
>> can I use spark in local mode using 4 cores to process 50gb data
>> effeciently?
>>
>> Thank you
>>
>> misha
>>
>
>


Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Of course :)

object sparkStreaming {
  def main(args: Array[String]) {
StreamingExamples.setStreamingLogLevels() //Set reasonable logging levels 
for streaming if the user has not configured log4j.
val topics = "test"
val brokers = "localhost:9092"
val topicsSet = topics.split(",").toSet
val sparkConf = new 
SparkConf().setAppName("KafkaDroneCalc").setMaster("local") 
//spark://localhost:7077
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(30))
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder] (ssc, kafkaParams, topicsSet)
val lines = messages.map(_._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
lines.foreachRDD( rdd => {
  val df = sqlContext.read.json(rdd)
  df.registerTempTable(“drone")
  sqlContext.sql("SELECT id, AVG(temp), AVG(rotor_rpm), AVG(winddirection), 
AVG(windspeed) FROM drone GROUP BY id").show()
})
ssc.start()
ssc.awaitTermination()
  }
}
I haven’t checked long running performance though. 

Regards,

Siva

> On 15-Jun-2016, at 5:02 PM, Jacek Laskowski  wrote:
> 
> Hi,
> 
> Good to hear so! Mind sharing a few snippets of your solution?
> 
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
> 
> 
> On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S  wrote:
>> Thanks Jacek,
>> 
>> Job completed!! :) Just used data frames and sql query. Very clean and
>> functional code.
>> 
>> Siva
>> 
>> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski  wrote:
>> 
>> mapWithState
>> 
>> 



Re: update mysql in spark

2016-06-15 Thread Cheng Lian
Spark SQL doesn't support update command yet.

On Wed, Jun 15, 2016, 9:08 AM spR  wrote:

> hi,
>
> can we write a update query using sqlcontext?
>
> sqlContext.sql("update act1 set loc = round(loc,4)")
>
> what is wrong in this? I get the following error.
>
> Py4JJavaError: An error occurred while calling o20.sql.
> : java.lang.RuntimeException: [1.1] failure: ``with'' expected but identifier 
> update found
>
> update act1 set loc = round(loc,4)
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
>   at 
> org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
>   at 
> org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
>   at 
> org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
>   at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>   at 
> scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
>   at 
> scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
>   at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
>   at 
> org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
>   at 
> org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:43)
>   at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:231)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
>   at py4j.Gateway.invoke(Gateway.java:259)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:209)
>   at java.lang.Thread.run(Thread.java:745)
>
>
>
>


update mysql in spark

2016-06-15 Thread spR
hi,

can we write a update query using sqlcontext?

sqlContext.sql("update act1 set loc = round(loc,4)")

what is wrong in this? I get the following error.

Py4JJavaError: An error occurred while calling o20.sql.
: java.lang.RuntimeException: [1.1] failure: ``with'' expected but
identifier update found

update act1 set loc = round(loc,4)
^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)
at 
org.apache.spark.sql.catalyst.DefaultParserDialect.parse(ParserDialect.scala:67)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.SQLContext$$anonfun$2.apply(SQLContext.scala:211)
at 
org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:114)
at 
org.apache.spark.sql.execution.SparkSQLParser$$anonfun$org$apache$spark$sql$execution$SparkSQLParser$$others$1.apply(SparkSQLParser.scala:113)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:136)
at scala.util.parsing.combinator.Parsers$Success.map(Parsers.scala:135)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$map$1.apply(Parsers.scala:242)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1$$anonfun$apply$2.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Failure.append(Parsers.scala:202)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
at 
scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at 
scala.util.parsing.combinator.Parsers$$anon$2$$anonfun$apply$14.apply(Parsers.scala:891)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
scala.util.parsing.combinator.Parsers$$anon$2.apply(Parsers.scala:890)
at 
scala.util.parsing.combinator.PackratParsers$$anon$1.apply(PackratParsers.scala:110)
at 
org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:34)
at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
at 
org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:208)
at 
org.apache.spark.sql.execution.datasources.DDLParser.parse(DDLParser.scala:43)
at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:231)
at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)


Re: choice of RDD function

2016-06-15 Thread Jacek Laskowski
Hi,

Good to hear so! Mind sharing a few snippets of your solution?

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jun 15, 2016 at 5:03 PM, Sivakumaran S  wrote:
> Thanks Jacek,
>
> Job completed!! :) Just used data frames and sql query. Very clean and
> functional code.
>
> Siva
>
> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski  wrote:
>
> mapWithState
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



processing 50 gb data using just one machine

2016-06-15 Thread spR
Hi,

can I use spark in local mode using 4 cores to process 50gb data
effeciently?

Thank you

misha


Re: Is that normal spark performance?

2016-06-15 Thread Deepak Goel
I am not an expert, but it seems all your processing is done on node1 while
node2 is lying idle

Hey

Namaskara~Nalama~Guten Tag~Bonjour


   --
Keigu

Deepak
73500 12833
www.simtree.net, dee...@simtree.net
deic...@gmail.com

LinkedIn: www.linkedin.com/in/deicool
Skype: thumsupdeicool
Google talk: deicool
Blog: http://loveandfearless.wordpress.com
Facebook: http://www.facebook.com/deicool

"Contribute to the world, environment and more : http://www.gridrepublic.org
"

On Wed, Jun 15, 2016 at 7:35 PM, Jörn Franke  wrote:

>
> What Volume do you have? Why do not you use the corresponding Cassandra
> functionality directly?
> If you do it once and not iteratively in-memory you cannot expect so much
> improvement
>
> On 15 Jun 2016, at 16:01, nikita.dobryukha  wrote:
>
> We use Cassandra 3.5 + Spark 1.6.1 in 2-node cluster (8 cores and 1g
> memory per node). There is the following Cassandra table
>
> CREATE TABLE schema.trade (
> symbol text,
> date int,
> trade_time timestamp,
> reporting_venue text,
> trade_id bigint,
> ref_trade_id bigint,
> action_type text,
> price double,
> quantity int,
> condition_code text,
> PRIMARY KEY ((symbol, date), trade_time, trade_id)
> ) WITH compaction = {'class' : 'DateTieredCompactionStrategy', 
> 'base_time_seconds' : '3600', 'max_sstable_age_days' : '93'};
>
> And I want to calculate percentage of volume: sum of all volume from
> trades in the relevant security during the time period groupped by exchange
> and time bar (1 or 5 minutes). I've created an example:
>
> void getPercentageOfVolume(String symbol, Integer date, Timestamp timeFrom, 
> Timestamp timeTill, Integer barWidth) {
> char MILLISECOND_TO_MINUTE_MULTIPLIER = 60_000;
> LOG.info ("start");
> JavaPairRDD counts = 
> javaFunctions(sparkContext).cassandraTable("schema", "trade")
> .filter(row ->
> row.getString("symbol").equals(symbol) && 
> row.getInt("date").equals(date) &&
> row.getDateTime("trade_time").getMillis() >= 
> timeFrom.getTime() &&
> row.getDateTime("trade_time").getMillis() < 
> timeTill.getTime())
> .mapToPair(row ->
> new Tuple2<>(
> new Tuple2(
> new Timestamp(
> 
> (row.getDateTime("trade_time").getMillis() / (barWidth * 
> MILLISECOND_TO_MINUTE_MULTIPLIER)) * barWidth * 
> MILLISECOND_TO_MINUTE_MULTIPLIER
> ),
> row.getString("reporting_venue")),
> row.getInt("quantity")
> )
> ).reduceByKey((a, b) -> a + b);
> LOG.info (counts.collect().toString());
> LOG.info ("finish");
> }
>
> ...
> [2016-06-15 09:25:27.014] [INFO ] [main] [EquityTCAAnalytics] start
> [2016-06-15 09:25:28.000] [INFO ] [main] [NettyUtil] Found Netty's native 
> epoll transport in the classpath, using it
> [2016-06-15 09:25:28.518] [INFO ] [main] [Cluster] New Cassandra host 
> /node1:9042 added
> [2016-06-15 09:25:28.519] [INFO ] [main] [LocalNodeFirstLoadBalancingPolicy] 
> Added host node1 (datacenter1)
> [2016-06-15 09:25:28.519] [INFO ] [main] [Cluster] New Cassandra host 
> /node2:9042 added
> [2016-06-15 09:25:28.520] [INFO ] [main] [CassandraConnector] Connected to 
> Cassandra cluster: Cassandra
> [2016-06-15 09:25:29.115] [INFO ] [main] [SparkContext] Starting job: collect 
> at EquityTCAAnalytics.java:88
> [2016-06-15 09:25:29.385] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Registering RDD 2 (mapToPair at EquityTCAAnalytics.java:78)
> [2016-06-15 09:25:29.388] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Got job 0 (collect at EquityTCAAnalytics.java:88) with 5 output partitions
> [2016-06-15 09:25:29.388] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Final stage: ResultStage 1 (collect at EquityTCAAnalytics.java:88)
> [2016-06-15 09:25:29.389] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Parents of final stage: List(ShuffleMapStage 0)
> [2016-06-15 09:25:29.391] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Missing parents: List(ShuffleMapStage 0)
> [2016-06-15 09:25:29.400] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Submitting ShuffleMapStage 0 (MapPartitionsRDD[2] at mapToPair at 
> EquityTCAAnalytics.java:78), which has no missing parents
> [2016-06-15 09:25:29.594] [INFO ] [dag-scheduler-event-loop] [MemoryStore] 
> Block broadcast_0 stored as values in memory (estimated size 10.8 KB, free 
> 10.8 KB)
> [2016-06-15 09:25:29.642] [INFO ] [dag-scheduler-event-loop] [MemoryStore] 
> Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.4 KB, 
> free 16.3 KB)
> [2016-06-15 09:25:29.647] [INFO ] [dispatcher-event-loop-7] 
> [BlockManagerInfo] Added broadcast_0_piece0 in memory on node2:44871 (size: 
> 5.4 KB, free: 2.4 GB)

vecotors inside columns

2016-06-15 Thread pseudo oduesp
hi ,
i want ask question about vector.dense or spars :

imagine i have dataframe with columns   and one of them contain vectors .

my question can i give this columns to machine learning algorithmes like
one value ?


df.col1 | df.col2 |
1 | (1,[2],[3] ,[] ...[6])
2 | (1,[5],[3] ,[] ...[9])
3 | (1,[5],[3] ,[] ...[10])

thanks


Re: choice of RDD function

2016-06-15 Thread Sivakumaran S
Thanks Jacek,

Job completed!! :) Just used data frames and sql query. Very clean and 
functional code.

Siva

> On 15-Jun-2016, at 3:10 PM, Jacek Laskowski  wrote:
> 
> mapWithState



Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
Yeah well... the prior was high... but don't have enough data on Mich to
have an accurate likelihood :-)
But ok, my bad, I continue with the preview stuff and leave this thread in
peace ^^
tx ted
cheers

On Wed, Jun 15, 2016 at 4:47 PM Ted Yu  wrote:

> Andy:
> You should sense the tone in Mich's response.
>
> To my knowledge, there hasn't been an RC for the 2.0 release yet.
> Once we have an RC, it goes through the normal voting process.
>
> FYI
>
> On Wed, Jun 15, 2016 at 7:38 AM, andy petrella 
> wrote:
>
>> > tomorrow lunch time
>> Which TZ :-) → I'm working on the update of some materials that Dean
>> Wampler and myself will give tomorrow at Scala Days
>> 
>>  (well
>> tomorrow CEST).
>>
>> Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think
>> 2.0.0 will be released before 6PM CEST (9AM PDT)? I don't want to be a joke
>> in front of the audience with my almost cutting edge version :-P
>>
>> tx
>>
>>
>> On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Tomorrow lunchtime.
>>>
>>> Btw can you stop spamming every big data forum about good interview
>>> questions book for big data!
>>>
>>> I have seen your mails on this big data book in spark, hive and tez
>>> forums and I am sure there are many others. That seems to be the only mail
>>> you send around.
>>>
>>> This forum is for technical discussions not for promotional material.
>>> Please confine yourself to technical matters
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 15 June 2016 at 12:45, Chaturvedi Chola 
>>> wrote:
>>>
 when is the spark 2.0 release planned

>>>
>>> --
>> andy
>>
>
> --
andy


Re: Spark 2.0 release date

2016-06-15 Thread Ted Yu
Andy:
You should sense the tone in Mich's response.

To my knowledge, there hasn't been an RC for the 2.0 release yet.
Once we have an RC, it goes through the normal voting process.

FYI

On Wed, Jun 15, 2016 at 7:38 AM, andy petrella 
wrote:

> > tomorrow lunch time
> Which TZ :-) → I'm working on the update of some materials that Dean
> Wampler and myself will give tomorrow at Scala Days
> 
>  (well
> tomorrow CEST).
>
> Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think
> 2.0.0 will be released before 6PM CEST (9AM PDT)? I don't want to be a joke
> in front of the audience with my almost cutting edge version :-P
>
> tx
>
>
> On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh 
> wrote:
>
>> Tomorrow lunchtime.
>>
>> Btw can you stop spamming every big data forum about good interview
>> questions book for big data!
>>
>> I have seen your mails on this big data book in spark, hive and tez
>> forums and I am sure there are many others. That seems to be the only mail
>> you send around.
>>
>> This forum is for technical discussions not for promotional material.
>> Please confine yourself to technical matters
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 15 June 2016 at 12:45, Chaturvedi Chola 
>> wrote:
>>
>>> when is the spark 2.0 release planned
>>>
>>
>> --
> andy
>


Re: Spark 2.0 release date

2016-06-15 Thread andy petrella
> tomorrow lunch time
Which TZ :-) → I'm working on the update of some materials that Dean
Wampler and myself will give tomorrow at Scala Days

(well
tomorrow CEST).

Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think
2.0.0 will be released before 6PM CEST (9AM PDT)? I don't want to be a joke
in front of the audience with my almost cutting edge version :-P

tx


On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh 
wrote:

> Tomorrow lunchtime.
>
> Btw can you stop spamming every big data forum about good interview
> questions book for big data!
>
> I have seen your mails on this big data book in spark, hive and tez forums
> and I am sure there are many others. That seems to be the only mail you
> send around.
>
> This forum is for technical discussions not for promotional material.
> Please confine yourself to technical matters
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 June 2016 at 12:45, Chaturvedi Chola 
> wrote:
>
>> when is the spark 2.0 release planned
>>
>
> --
andy


Get both feature importance and ROC curve from a random forest classifier

2016-06-15 Thread matd
Hi ml folks !

I'm using a Random Forest for a binary classification.
I'm interested in getting both the ROC *curve* and the feature importance
from the trained model.

If I'm not missing something obvious, the ROC curve is only available in the
old mllib world, via BinaryClassificationMetrics. In the new ml package,
only the areaUnderROC and areaUnderPR are available through
BinaryClassificationEvaluator.

The feature importance is only available in ml package, through
RandomForestClassificationModel.

Any idea to get both ?

Mathieu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Get-both-feature-importance-and-ROC-curve-from-a-random-forest-classifier-tp27175.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: choice of RDD function

2016-06-15 Thread Jacek Laskowski
Hi,

Ad Q1, yes. See stateful operators like mapWithState and windows.

Ad Q2, RDDs should be fine (and available out of the box), but I'd give
Datasets a try too since they're .toDF away.

Jacek
On 14 Jun 2016 10:29 p.m., "Sivakumaran S"  wrote:

Dear friends,

I have set up Kafka 0.9.0.0, Spark 1.6.1 and Scala 2.10. My source is
sending a json string periodically to a topic in kafka. I am able to
consume this topic using Spark Streaming and print it. The schema of the
source json is as follows:

{ “id”: 121156, “ht”: 42, “rotor_rpm”: 180, “temp”: 14.2, “time”:146593512}
{ “id”: 121157, “ht”: 18, “rotor_rpm”: 110, “temp”: 12.2, “time”: 146593512}
{ “id”: 121156, “ht”: 36, “rotor_rpm”: 160, “temp”: 14.4, “time”: 146593513}
{ “id”: 121157, “ht”: 19, “rotor_rpm”: 120, “temp”: 12.0, “time”: 146593513}
and so on.


In Spark streaming, I want to find the *average* of “ht” (height),
“rotor_rpm” and “temp” for each “id". I also want to find the max and min
of the same fields in the time window (300 seconds in this case).

Q1. Can this be done using plain RDD and streaming functions or does it
require Dataframes/SQL? There may be more fields added to the json at a
later stage. There will be a lot of “id”s at a later stage.

Q2. If it can be done using either, which one would render to be more
efficient and fast?

As of now, the entire set up is in a single laptop.

Thanks in advance.

Regards,

Siva


Re: Is that normal spark performance?

2016-06-15 Thread Jörn Franke

What Volume do you have? Why do not you use the corresponding Cassandra 
functionality directly? 
If you do it once and not iteratively in-memory you cannot expect so much 
improvement 

> On 15 Jun 2016, at 16:01, nikita.dobryukha  wrote:
> 
> We use Cassandra 3.5 + Spark 1.6.1 in 2-node cluster (8 cores and 1g memory 
> per node). There is the following Cassandra table
> CREATE TABLE schema.trade (
> symbol text,
> date int,
> trade_time timestamp,
> reporting_venue text,
> trade_id bigint,
> ref_trade_id bigint,
> action_type text,
> price double,
> quantity int,
> condition_code text,
> PRIMARY KEY ((symbol, date), trade_time, trade_id)
> ) WITH compaction = {'class' : 'DateTieredCompactionStrategy', 
> 'base_time_seconds' : '3600', 'max_sstable_age_days' : '93'};
> And I want to calculate percentage of volume: sum of all volume from trades 
> in the relevant security during the time period groupped by exchange and time 
> bar (1 or 5 minutes). I've created an example:
> void getPercentageOfVolume(String symbol, Integer date, Timestamp timeFrom, 
> Timestamp timeTill, Integer barWidth) {
> char MILLISECOND_TO_MINUTE_MULTIPLIER = 60_000;
> LOG.info("start");
> JavaPairRDD counts = 
> javaFunctions(sparkContext).cassandraTable("schema", "trade")
> .filter(row ->
> row.getString("symbol").equals(symbol) && 
> row.getInt("date").equals(date) &&
> row.getDateTime("trade_time").getMillis() >= 
> timeFrom.getTime() &&
> row.getDateTime("trade_time").getMillis() < 
> timeTill.getTime())
> .mapToPair(row ->
> new Tuple2<>(
> new Tuple2(
> new Timestamp(
> 
> (row.getDateTime("trade_time").getMillis() / (barWidth * 
> MILLISECOND_TO_MINUTE_MULTIPLIER)) * barWidth * 
> MILLISECOND_TO_MINUTE_MULTIPLIER
> ),
> row.getString("reporting_venue")),
> row.getInt("quantity")
> )
> ).reduceByKey((a, b) -> a + b);
> LOG.info(counts.collect().toString());
> LOG.info("finish");
> }
> ...
> [2016-06-15 09:25:27.014] [INFO ] [main] [EquityTCAAnalytics] start
> [2016-06-15 09:25:28.000] [INFO ] [main] [NettyUtil] Found Netty's native 
> epoll transport in the classpath, using it
> [2016-06-15 09:25:28.518] [INFO ] [main] [Cluster] New Cassandra host 
> /node1:9042 added
> [2016-06-15 09:25:28.519] [INFO ] [main] [LocalNodeFirstLoadBalancingPolicy] 
> Added host node1 (datacenter1)
> [2016-06-15 09:25:28.519] [INFO ] [main] [Cluster] New Cassandra host 
> /node2:9042 added
> [2016-06-15 09:25:28.520] [INFO ] [main] [CassandraConnector] Connected to 
> Cassandra cluster: Cassandra
> [2016-06-15 09:25:29.115] [INFO ] [main] [SparkContext] Starting job: collect 
> at EquityTCAAnalytics.java:88
> [2016-06-15 09:25:29.385] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Registering RDD 2 (mapToPair at EquityTCAAnalytics.java:78)
> [2016-06-15 09:25:29.388] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Got job 0 (collect at EquityTCAAnalytics.java:88) with 5 output partitions
> [2016-06-15 09:25:29.388] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Final stage: ResultStage 1 (collect at EquityTCAAnalytics.java:88)
> [2016-06-15 09:25:29.389] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Parents of final stage: List(ShuffleMapStage 0)
> [2016-06-15 09:25:29.391] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Missing parents: List(ShuffleMapStage 0)
> [2016-06-15 09:25:29.400] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Submitting ShuffleMapStage 0 (MapPartitionsRDD[2] at mapToPair at 
> EquityTCAAnalytics.java:78), which has no missing parents
> [2016-06-15 09:25:29.594] [INFO ] [dag-scheduler-event-loop] [MemoryStore] 
> Block broadcast_0 stored as values in memory (estimated size 10.8 KB, free 
> 10.8 KB)
> [2016-06-15 09:25:29.642] [INFO ] [dag-scheduler-event-loop] [MemoryStore] 
> Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.4 KB, 
> free 16.3 KB)
> [2016-06-15 09:25:29.647] [INFO ] [dispatcher-event-loop-7] 
> [BlockManagerInfo] Added broadcast_0_piece0 in memory on node2:44871 (size: 
> 5.4 KB, free: 2.4 GB)
> [2016-06-15 09:25:29.650] [INFO ] [dag-scheduler-event-loop] [SparkContext] 
> Created broadcast 0 from broadcast at DAGScheduler.scala:1006
> [2016-06-15 09:25:29.658] [INFO ] [dag-scheduler-event-loop] [DAGScheduler] 
> Submitting 5 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[2] at 
> mapToPair at EquityTCAAnalytics.java:78)
> [2016-06-15 09:25:29.661] [INFO ] [dag-scheduler-event-loop] 
> [TaskSchedulerImpl] Adding task set 0.0 with 5 tasks
> [2016-06-15 09:25:30.006] [INFO ] [dispatcher-event-loop-7] 
> [SparkDeploySchedulerBackend] Registered executor NettyRpcEndpointRef(null) 
> 

Re: spark standalone High availibilty issues

2016-06-15 Thread Jacek Laskowski
Can you post the error?

Jacek
On 14 Jun 2016 10:56 p.m., "Darshan Singh"  wrote:

> Hi,
>
> I am using standalone spark cluster and using zookeeper cluster for the
> high availbilty. I am getting sometimes error when I start the master. The
> error is related to Leader election in curator and says that noMethod found
> (getProcess) and master doesnt get started.
>
> Just wondering what could be causing the issue.
>
> I am using same zookeeper cluster for HDFS High availability and it is
> working just fine.
>
>
> Thanks
>


Re: can not show all data for this table

2016-06-15 Thread Mich Talebzadeh
at last some progress :)

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 June 2016 at 10:52, Lee Ho Yeung  wrote:

> Hi Mich,
>
> i find my problem cause now, i missed setting delimiter which is tab,
>
> but it got error,
>
> and i notice that only libre office and open and read well, even if Excel
> in window, it still can not separate in well format
>
> scala> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("header",
> "true").option("inferSchema", "true").option("delimiter",
> "").load("/home/martin/result002.csv")
> java.lang.StringIndexOutOfBoundsException: String index out of range: 0
>
>
> On Wed, Jun 15, 2016 at 12:14 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> there may be an issue with data in your csv file. like blank header line
>> etc.
>>
>> sounds like you have an issue there. I normally get rid of blank lines
>> before putting csv file in hdfs.
>>
>> can you actually select from that temp table. like
>>
>> sql("select TransactionDate, TransactionType, Description, Value,
>> Balance, AccountName, AccountNumber from tmp").take(2)
>>
>> replace those with your column names. they are mapped using case class
>>
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 15 June 2016 at 03:02, Lee Ho Yeung  wrote:
>>
>>> filter also has error
>>>
>>> 16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port
>>> 4040. Attempting port 4041.
>>> Spark context available as sc.
>>> SQL context available as sqlContext.
>>>
>>> scala> import org.apache.spark.sql.SQLContext
>>> import org.apache.spark.sql.SQLContext
>>>
>>> scala> val sqlContext = new SQLContext(sc)
>>> sqlContext: org.apache.spark.sql.SQLContext =
>>> org.apache.spark.sql.SQLContext@3114ea
>>>
>>> scala> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>>> 16/06/14 19:00:32 WARN SizeEstimator: Failed to check whether
>>> UseCompressedOops is set; assuming yes
>>> Java HotSpot(TM) Client VM warning: You have loaded library
>>> /tmp/libnetty-transport-native-epoll7823347435914767500.so which might have
>>> disabled stack guard. The VM will try to fix the stack guard now.
>>> It's highly recommended that you fix the library with 'execstack -c
>>> ', or link it with '-z noexecstack'.
>>> df: org.apache.spark.sql.DataFrame = [a0a1a2a3a4
>>> a5a6a7a8a9: string]
>>>
>>> scala> df.printSchema()
>>> root
>>>  |-- a0a1a2a3a4a5a6a7a8a9:
>>> string (nullable = true)
>>>
>>>
>>> scala> df.registerTempTable("sales")
>>>
>>> scala> df.filter($"a0".contains("found
>>> deep=1")).filter($"a1".contains("found
>>> deep=1")).filter($"a2".contains("found deep=1"))
>>> org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input
>>> columns: [a0a1a2a3a4a5a6a7a8a9];
>>> at
>>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jun 14, 2016 at 6:19 PM, Lee Ho Yeung 
>>> wrote:
>>>
 after tried following commands, can not show data


 https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing

 https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing

 /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
 com.databricks:spark-csv_2.11:1.4.0

 import org.apache.spark.sql.SQLContext

 val sqlContext = new SQLContext(sc)
 val df =
 sqlContext.read.format("com.databricks.spark.csv").option("header",
 "true").option("inferSchema", "true").load("/home/martin/result002.csv")
 df.printSchema()
 df.registerTempTable("sales")
 val aggDF = sqlContext.sql("select * from sales where a0 like
 \"%deep=3%\"")
 df.collect.foreach(println)
 aggDF.collect.foreach(println)



 val df =
 sqlContext.read.format("com.databricks.spark.csv").option("header",
 "true").load("/home/martin/result002.csv")
 df.printSchema()
 df.registerTempTable("sales")
 sqlContext.sql("select * from sales").take(30).foreach(println)

>>>
>>>
>>
>


Is that normal spark performance?

2016-06-15 Thread nikita.dobryukha
We use Cassandra 3.5 + Spark 1.6.1 in 2-node cluster (8 cores and 1g memory
per node). There is the following Cassandra tableAnd I want to calculate
percentage of volume: sum of all volume from trades in the relevant security
during the time period groupped by exchange and time bar (1 or 5 minutes).
I've created an example:32.5 s is normal?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Is-that-normal-spark-performance-tp27174.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 2.0 release date

2016-06-15 Thread Mich Talebzadeh
Tomorrow lunchtime.

Btw can you stop spamming every big data forum about good interview
questions book for big data!

I have seen your mails on this big data book in spark, hive and tez forums
and I am sure there are many others. That seems to be the only mail you
send around.

This forum is for technical discussions not for promotional material.
Please confine yourself to technical matters





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 15 June 2016 at 12:45, Chaturvedi Chola 
wrote:

> when is the spark 2.0 release planned
>


RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
Hi Yogesh,

I'm not sure if this is possible or not. I'd be interested in knowing. My gut 
thinks it would be an anti-pattern if it's possible to do something like this 
and that's why I handle it in either the foreachRDD or foreachPartition. The 
way I look at spark streaming is as an application which is always running and 
doing something like windowed batching or microbatching or whatever I'm trying 
to accomplish. IF an RDD I get from Kafka is empty then I don't run the rest of 
the job.  IF the RDD I'm get from Kafka has some number of events then I'll 
process the RDD further. 

David Newberger

-Original Message-
From: Yogesh Vyas [mailto:informy...@gmail.com] 
Sent: Wednesday, June 15, 2016 8:30 AM
To: David Newberger
Subject: Re: Handle empty kafka in Spark Streaming

I am looking for something which checks the JavaPairReceiverInputDStreambefore 
further going for any operations.
For example, if I have get JavaPairReceiverInputDStream in following
manner:

JavaPairReceiverInputDStream 
message=KafkaUtils.createStream(ssc, zkQuorum, group, topics, 
StorageLevel.MEMORY_AND_DISK_SER());

Then I would like check whether message is empty or not. If it not empty then 
go for further operations else wait for some data in Kafka.

On Wed, Jun 15, 2016 at 6:31 PM, David Newberger  
wrote:
> If you're asking how to handle no messages in a batch window then I would add 
> an isEmpty check like:
>
> dStream.foreachRDD(rdd => {
> if (!rdd.isEmpty())
> ...
> }
>
> Or something like that.
>
>
> David Newberger
>
> -Original Message-
> From: Yogesh Vyas [mailto:informy...@gmail.com]
> Sent: Wednesday, June 15, 2016 6:31 AM
> To: user
> Subject: Handle empty kafka in Spark Streaming
>
> Hi,
>
> Does anyone knows how to handle empty Kafka while Spark Streaming job is 
> running ?
>
> Regards,
> Yogesh
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For 
> additional commands, e-mail: user-h...@spark.apache.org
>


Re: Limit pyspark.daemon threads

2016-06-15 Thread Gene Pang
As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory,
and you can then share that RDD across different jobs. If you would like to
run Spark on Alluxio, this documentation can help:
http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html

Thanks,
Gene

On Tue, Jun 14, 2016 at 12:44 AM, agateaaa  wrote:

> Hi,
>
> I am seeing this issue too with pyspark (Using Spark 1.6.1).  I have set
> spark.executor.cores to 1, but I see that whenever streaming batch starts
> processing data, see python -m pyspark.daemon processes increase gradually
> to about 5, (increasing CPU% on a box about 4-5 times, each pyspark.daemon
> takes up around 100 % CPU)
>
> After the processing is done 4 pyspark.daemon processes go away and we are
> left with one till the next batch run. Also sometimes the  CPU usage for
> executor process spikes to about 800% even though spark.executor.core is
> set to 1
>
> e.g. top output
> PID USER  PR   NI  VIRT  RES  SHR S   %CPU %MEMTIME+  COMMAND
> 19634 spark 20   0 8871420 1.790g  32056 S 814.1  2.9   0:39.33
> /usr/lib/j+ <--EXECUTOR
>
> 13897 spark 20   0   46576  17916   6720 S   100.0  0.0   0:00.17
> python -m + <--pyspark.daemon
> 13991 spark 20   0   46524  15572   4124 S   98.0  0.0   0:08.18
> python -m + <--pyspark.daemon
> 14488 spark 20   0   46524  15636   4188 S   98.0  0.0   0:07.25
> python -m + <--pyspark.daemon
> 14514 spark 20   0   46524  15636   4188 S   94.0  0.0   0:06.72
> python -m + <--pyspark.daemon
> 14526 spark 20   0   48200  17172   4092 S   0.0  0.0   0:00.38 python
> -m + <--pyspark.daemon
>
>
>
> Is there any way to control the number of pyspark.daemon processes that
> get spawned ?
>
> Thank you
> Agateaaa
>
> On Sun, Mar 27, 2016 at 1:08 AM, Sven Krasser  wrote:
>
>> Hey Ken,
>>
>> 1. You're correct, cached RDDs live on the JVM heap. (There's an off-heap
>> storage option using Alluxio, formerly Tachyon, with which I have no
>> experience however.)
>>
>> 2. The worker memory setting is not a hard maximum unfortunately. What
>> happens is that during aggregation the Python daemon will check its process
>> size. If the size is larger than this setting, it will start spilling to
>> disk. I've seen many occasions where my daemons grew larger. Also, you're
>> relying on Python's memory management to free up space again once objects
>> are evicted. In practice, leave this setting reasonably small but make sure
>> there's enough free memory on the machine so you don't run into OOM
>> conditions. If the lower memory setting causes strains for your users, make
>> sure they increase the parallelism of their jobs (smaller partitions
>> meaning less data is processed at a time).
>>
>> 3. I believe that is the behavior you can expect when setting
>> spark.executor.cores. I've not experimented much with it and haven't looked
>> at that part of the code, but what you describe also reflects my
>> understanding. Please share your findings here, I'm sure those will be very
>> helpful to others, too.
>>
>> One more suggestion for your users is to move to the Pyspark DataFrame
>> API. Much of the processing will then happen in the JVM, and you will bump
>> into fewer Python resource contention issues.
>>
>> Best,
>> -Sven
>>
>>
>> On Sat, Mar 26, 2016 at 1:38 PM, Carlile, Ken 
>> wrote:
>>
>>> This is extremely helpful!
>>>
>>> I’ll have to talk to my users about how the python memory limit should
>>> be adjusted and what their expectations are. I’m fairly certain we bumped
>>> it up in the dark past when jobs were failing because of insufficient
>>> memory for the python processes.
>>>
>>> So just to make sure I’m understanding correctly:
>>>
>>>
>>>- JVM memory (set by SPARK_EXECUTOR_MEMORY and/or
>>>SPARK_WORKER_MEMORY?) is where the RDDs are stored. Currently both of 
>>> those
>>>values are set to 90GB
>>>- spark.python.worker.memory controls how much RAM each python task
>>>can take maximum (roughly speaking. Currently set to 4GB
>>>- spark.task.cpus controls how many java worker threads will exist
>>>and thus indirectly how many pyspark daemon processes will exist
>>>
>>>
>>> I’m also looking into fixing my cron jobs so they don’t stack up by
>>> implementing flock in the jobs and changing how teardowns of the spark
>>> cluster work as far as failed workers.
>>>
>>> Thanks again,
>>> —Ken
>>>
>>> On Mar 26, 2016, at 4:08 PM, Sven Krasser  wrote:
>>>
>>> My understanding is that the spark.executor.cores setting controls the
>>> number of worker threads in the executor in the JVM. Each worker thread
>>> communicates then with a pyspark daemon process (these are not threads) to
>>> stream data into Python. There should be one daemon process per worker
>>> thread (but as I mentioned I sometimes see a low multiple).
>>>
>>> Your 4GB limit for Python is fairly high, that means even for 12 

Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Chanh Le
Hi Gene,
I am using Alluxio 1.1.0.
Spark 2.0 Preview version. 
Load from alluxio then cached and query for 2nd time. Spark will stuck.



> On Jun 15, 2016, at 8:42 PM, Gene Pang  wrote:
> 
> Hi,
> 
> Which version of Alluxio are you using?
> 
> Thanks,
> Gene
> 
> On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le  > wrote:
> I am testing Spark 2.0
> I load data from alluxio and cached then I query but the first query is ok 
> because it kick off cache action. But after that I run the query again and 
> it’s stuck.
> I ran in cluster 5 nodes in spark-shell.
> 
> Did anyone has this issue?
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> 
> For additional commands, e-mail: user-h...@spark.apache.org 
> 
> 
> 



Re: Spark 2.0 Preview After caching query didn't work and can't kill job.

2016-06-15 Thread Gene Pang
Hi,

Which version of Alluxio are you using?

Thanks,
Gene

On Tue, Jun 14, 2016 at 3:45 AM, Chanh Le  wrote:

> I am testing Spark 2.0
> I load data from alluxio and cached then I query but the first query is ok
> because it kick off cache action. But after that I run the query again and
> it’s stuck.
> I ran in cluster 5 nodes in spark-shell.
>
> Did anyone has this issue?
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Substract two DStreams

2016-06-15 Thread Matthias Niehoff
Hi,

i want to subtract 2 DStreams (based on the same Input Stream) to get all
elements that exist in the original stream, but not in the modified stream
(the modified Stream is changed using joinWithCassandraTable which does an
inner join and because of this might remove entries).

Subtract is only possible on RDDs. So I could use a foreachRDD right in the
beginning of the Stream processing and work on rdds. I think its quite ugly
to use the output op at the beginning and then implement a lot of
transformations in the foreachRDD. So could you think of different ways to
do an efficient diff between to DStreams?

Thank you

-- 
Matthias Niehoff | IT-Consultant | Agile Software Factory  | Consulting
codecentric AG | Zeppelinstr 2 | 76185 Karlsruhe | Deutschland
tel: +49 (0) 721.9595-681 | fax: +49 (0) 721.9595-666 | mobil: +49 (0)
172.1702676
www.codecentric.de | blog.codecentric.de | www.meettheexperts.de |
www.more4fi.de

Sitz der Gesellschaft: Solingen | HRB 25917| Amtsgericht Wuppertal
Vorstand: Michael Hochgürtel . Mirko Novakovic . Rainer Vehns
Aufsichtsrat: Patric Fedlmeier (Vorsitzender) . Klaus Jäger . Jürgen Schütz

Diese E-Mail einschließlich evtl. beigefügter Dateien enthält vertrauliche
und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige
Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie
bitte sofort den Absender und löschen Sie diese E-Mail und evtl.
beigefügter Dateien umgehend. Das unerlaubte Kopieren, Nutzen oder Öffnen
evtl. beigefügter Dateien sowie die unbefugte Weitergabe dieser E-Mail ist
nicht gestattet


RE: Handle empty kafka in Spark Streaming

2016-06-15 Thread David Newberger
If you're asking how to handle no messages in a batch window then I would add 
an isEmpty check like:

dStream.foreachRDD(rdd => {
if (!rdd.isEmpty()) 
...
}

Or something like that. 


David Newberger

-Original Message-
From: Yogesh Vyas [mailto:informy...@gmail.com] 
Sent: Wednesday, June 15, 2016 6:31 AM
To: user
Subject: Handle empty kafka in Spark Streaming

Hi,

Does anyone knows how to handle empty Kafka while Spark Streaming job is 
running ?

Regards,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



RE: streaming example has error

2016-06-15 Thread David Newberger
Have you tried to “set spark.driver.allowMultipleContexts = true”?

David Newberger

From: Lee Ho Yeung [mailto:jobmatt...@gmail.com]
Sent: Tuesday, June 14, 2016 8:34 PM
To: user@spark.apache.org
Subject: streaming example has error

when simulate streaming with nc -lk 
got error below,
then i try example,

martin@ubuntu:~/Downloads$ /home/martin/Downloads/spark-1.6.1/bin/run-example 
streaming.NetworkWordCount localhost 
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/06/14 18:33:06 INFO StreamingExamples: Setting log level to [WARN] for 
streaming example. To override add a custom log4j.properties to the classpath.
16/06/14 18:33:06 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
16/06/14 18:33:06 WARN Utils: Your hostname, ubuntu resolves to a loopback 
address: 127.0.1.1; using 192.168.157.134 instead (on interface eth0)
16/06/14 18:33:06 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
16/06/14 18:33:13 WARN SizeEstimator: Failed to check whether UseCompressedOops 
is set; assuming yes

got error too.

import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", )
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
ssc.start()
ssc.awaitTermination()



scala> import org.apache.spark._
import org.apache.spark._

scala> import org.apache.spark.streaming._
import org.apache.spark.streaming._

scala> import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.StreamingContext._

scala> val conf = new 
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
conf: org.apache.spark.SparkConf = 
org.apache.spark.SparkConf@67bcaf

scala> val ssc = new StreamingContext(conf, Seconds(1))
16/06/14 18:28:44 WARN AbstractLifeCycle: FAILED 
SelectChannelConnector@0.0.0.0:4040:
 java.net.BindException: Address already in use
java.net.BindException: Address already in use
at sun.nio.ch.Net.bind0(Native Method)
at sun.nio.ch.Net.bind(Net.java:433)
at sun.nio.ch.Net.bind(Net.java:425)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
at 
org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
at 
org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
at 
org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at org.eclipse.jetty.server.Server.doStart(Server.java:293)
at 
org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
at 
org.apache.spark.ui.JettyUtils$.org$apache$spark$ui$JettyUtils$$connect$1(JettyUtils.scala:252)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at org.apache.spark.ui.JettyUtils$$anonfun$5.apply(JettyUtils.scala:262)
at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1988)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1979)
at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:262)
at org.apache.spark.ui.WebUI.bind(WebUI.scala:136)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at org.apache.spark.SparkContext$$anonfun$13.apply(SparkContext.scala:481)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.(SparkContext.scala:481)
at 
org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)
at 
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
at 
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:36)
at 
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:41)
at 
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:43)
at 
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at 
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at 
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at 
$line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $line37.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at 

Spark 2.0 release date

2016-06-15 Thread Chaturvedi Chola
when is the spark 2.0 release planned


Handle empty kafka in Spark Streaming

2016-06-15 Thread Yogesh Vyas
Hi,

Does anyone knows how to handle empty Kafka while Spark Streaming job
is running ?

Regards,
Yogesh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Adding h5 files in a zip to use with PySpark

2016-06-15 Thread ar7
I am using PySpark 1.6.1 for my spark application. I have additional modules
which I am loading using the argument --py-files. I also have a h5 file
which I need to access from one of the modules for initializing the
ApolloNet.

Is there any way I could access those files from the modules if I put them
in the same archive? I tried this approach but it was throwing an error
because the files are not there in every worker. I can think of one solution
which is copying the file to each of the workers but I want to know if there
are better ways to do it?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Adding-h5-files-in-a-zip-to-use-with-PySpark-tp27173.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



can spark help to prevent memory error for itertools.combinations(initlist, 2) in python script

2016-06-15 Thread Lee Ho Yeung
i write a python script which has itertools.combinations(initlist, 2)

but it got error when number of elements in initlist over 14,000

is it possible to use spark to do this work?

i have seen yatel can do this, is spark and yatel using hard disk as memory?

if so,

which need to change in python code ?


Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
Hi Mich,

i find my problem cause now, i missed setting delimiter which is tab,

but it got error,

and i notice that only libre office and open and read well, even if Excel
in window, it still can not separate in well format

scala> val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").option("delimiter",
"").load("/home/martin/result002.csv")
java.lang.StringIndexOutOfBoundsException: String index out of range: 0


On Wed, Jun 15, 2016 at 12:14 PM, Mich Talebzadeh  wrote:

> there may be an issue with data in your csv file. like blank header line
> etc.
>
> sounds like you have an issue there. I normally get rid of blank lines
> before putting csv file in hdfs.
>
> can you actually select from that temp table. like
>
> sql("select TransactionDate, TransactionType, Description, Value, Balance,
> AccountName, AccountNumber from tmp").take(2)
>
> replace those with your column names. they are mapped using case class
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 June 2016 at 03:02, Lee Ho Yeung  wrote:
>
>> filter also has error
>>
>> 16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port
>> 4040. Attempting port 4041.
>> Spark context available as sc.
>> SQL context available as sqlContext.
>>
>> scala> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SQLContext
>>
>> scala> val sqlContext = new SQLContext(sc)
>> sqlContext: org.apache.spark.sql.SQLContext =
>> org.apache.spark.sql.SQLContext@3114ea
>>
>> scala> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>> 16/06/14 19:00:32 WARN SizeEstimator: Failed to check whether
>> UseCompressedOops is set; assuming yes
>> Java HotSpot(TM) Client VM warning: You have loaded library
>> /tmp/libnetty-transport-native-epoll7823347435914767500.so which might have
>> disabled stack guard. The VM will try to fix the stack guard now.
>> It's highly recommended that you fix the library with 'execstack -c
>> ', or link it with '-z noexecstack'.
>> df: org.apache.spark.sql.DataFrame = [a0a1a2a3a4a5
>> a6a7a8a9: string]
>>
>> scala> df.printSchema()
>> root
>>  |-- a0a1a2a3a4a5a6a7a8a9: string
>> (nullable = true)
>>
>>
>> scala> df.registerTempTable("sales")
>>
>> scala> df.filter($"a0".contains("found
>> deep=1")).filter($"a1".contains("found
>> deep=1")).filter($"a2".contains("found deep=1"))
>> org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input
>> columns: [a0a1a2a3a4a5a6a7a8a9];
>> at
>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>
>>
>>
>>
>>
>> On Tue, Jun 14, 2016 at 6:19 PM, Lee Ho Yeung 
>> wrote:
>>
>>> after tried following commands, can not show data
>>>
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing
>>>
>>> /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
>>> com.databricks:spark-csv_2.11:1.4.0
>>>
>>> import org.apache.spark.sql.SQLContext
>>>
>>> val sqlContext = new SQLContext(sc)
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> val aggDF = sqlContext.sql("select * from sales where a0 like
>>> \"%deep=3%\"")
>>> df.collect.foreach(println)
>>> aggDF.collect.foreach(println)
>>>
>>>
>>>
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> sqlContext.sql("select * from sales").take(30).foreach(println)
>>>
>>
>>
>


Spark SQL NoSuchMethodException...DriverWrapper.()

2016-06-15 Thread Mirko
Hi All,

I’m using Spark 1.6.1 and I’m getting the error below. This appear also with
the current branch 1.6
The code that is generating the error is loading a table from MsSql server.

sqlContext.read.format("jdbc").options(options).load()

I’ve also looked if the microsoft jdbc driver is loaded correctly and it is
(I’m using an uber jar with all the dependencies inside)
With the version 1.6.0 and the same source the error is not appearing.
I will really appreciate any help or suggestion.

Many thanks,
Mirko



Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at 
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: java.lang.InstantiationException:
org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper
at java.lang.Class.newInstance(Class.java:427)
at
org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:53)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:120)
at
org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91)
at
org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
at
com.ixxus.analytic.batch.AlfrescoToRDD.loadDataFrameAsRdd(AlfrescoToRDD.scala:250)
at 
com.ixxus.analytic.batch.AlfrescoToRDD.loadTags(AlfrescoToRDD.scala:189)
at
com.ixxus.analytic.phase.extraction.ExtractionJob$.execute(ExtractionJob.scala:29)
at com.ixxus.analytic.program.AllProgram$.main(AllProgram.scala:36)
at com.ixxus.analytic.program.AllProgram.main(AllProgram.scala)
... 6 more
Caused by: java.lang.NoSuchMethodException:
org.apache.spark.sql.execution.datasources.jdbc.DriverWrapper.()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 19 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-NoSuchMethodException-DriverWrapper-init-tp27171.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



how do I set TBLPROPERTIES in dataFrame.saveAsTable()?

2016-06-15 Thread Yang
I tried df.options(MAP(prop_name->prop_value)).saveAsTable(tb_name)

doesn't seem to work

thanks a lot!


Re: can not show all data for this table

2016-06-15 Thread Lee Ho Yeung
Hi Mich,

https://drive.google.com/file/d/0Bxs_ao6uuBDUQ2NfYnhvUl9EZXM/view?usp=sharing
https://drive.google.com/file/d/0Bxs_ao6uuBDUS1UzTWd1Q2VJdEk/view?usp=sharing

this time I ensure headers cover all data, only some columns which have
headers do not have data

but still can not show all data like i open libre office

/home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
com.databricks:spark-csv_2.11:1.4.0
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df =
sqlContext.read.format("com.databricks.spark.csv").option("header",
"true").option("inferSchema", "true").load("/home/martin/result002.csv")
df.printSchema()
df.registerTempTable("sales")
df.filter($"a3".contains("found deep=1"))





On Tue, Jun 14, 2016 at 9:14 PM, Mich Talebzadeh 
wrote:

> there may be an issue with data in your csv file. like blank header line
> etc.
>
> sounds like you have an issue there. I normally get rid of blank lines
> before putting csv file in hdfs.
>
> can you actually select from that temp table. like
>
> sql("select TransactionDate, TransactionType, Description, Value, Balance,
> AccountName, AccountNumber from tmp").take(2)
>
> replace those with your column names. they are mapped using case class
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 15 June 2016 at 03:02, Lee Ho Yeung  wrote:
>
>> filter also has error
>>
>> 16/06/14 19:00:27 WARN Utils: Service 'SparkUI' could not bind on port
>> 4040. Attempting port 4041.
>> Spark context available as sc.
>> SQL context available as sqlContext.
>>
>> scala> import org.apache.spark.sql.SQLContext
>> import org.apache.spark.sql.SQLContext
>>
>> scala> val sqlContext = new SQLContext(sc)
>> sqlContext: org.apache.spark.sql.SQLContext =
>> org.apache.spark.sql.SQLContext@3114ea
>>
>> scala> val df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>> 16/06/14 19:00:32 WARN SizeEstimator: Failed to check whether
>> UseCompressedOops is set; assuming yes
>> Java HotSpot(TM) Client VM warning: You have loaded library
>> /tmp/libnetty-transport-native-epoll7823347435914767500.so which might have
>> disabled stack guard. The VM will try to fix the stack guard now.
>> It's highly recommended that you fix the library with 'execstack -c
>> ', or link it with '-z noexecstack'.
>> df: org.apache.spark.sql.DataFrame = [a0a1a2a3a4a5
>> a6a7a8a9: string]
>>
>> scala> df.printSchema()
>> root
>>  |-- a0a1a2a3a4a5a6a7a8a9: string
>> (nullable = true)
>>
>>
>> scala> df.registerTempTable("sales")
>>
>> scala> df.filter($"a0".contains("found
>> deep=1")).filter($"a1".contains("found
>> deep=1")).filter($"a2".contains("found deep=1"))
>> org.apache.spark.sql.AnalysisException: cannot resolve 'a0' given input
>> columns: [a0a1a2a3a4a5a6a7a8a9];
>> at
>> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>>
>>
>>
>>
>>
>> On Tue, Jun 14, 2016 at 6:19 PM, Lee Ho Yeung 
>> wrote:
>>
>>> after tried following commands, can not show data
>>>
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUVkJYVmNaUGx2ZUE/view?usp=sharing
>>>
>>> https://drive.google.com/file/d/0Bxs_ao6uuBDUc3ltMVZqNlBUYVk/view?usp=sharing
>>>
>>> /home/martin/Downloads/spark-1.6.1/bin/spark-shell --packages
>>> com.databricks:spark-csv_2.11:1.4.0
>>>
>>> import org.apache.spark.sql.SQLContext
>>>
>>> val sqlContext = new SQLContext(sc)
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").option("inferSchema", "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> val aggDF = sqlContext.sql("select * from sales where a0 like
>>> \"%deep=3%\"")
>>> df.collect.foreach(println)
>>> aggDF.collect.foreach(println)
>>>
>>>
>>>
>>> val df =
>>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>>> "true").load("/home/martin/result002.csv")
>>> df.printSchema()
>>> df.registerTempTable("sales")
>>> sqlContext.sql("select * from sales").take(30).foreach(println)
>>>
>>
>>
>