Re: JDBC Streams

2015-07-05 Thread Akhil Das
If you want a long running application, then go with spark streaming (which kind of blocks your resources). On the other hand, if you use job server then you can actually use the resources (CPUs) for other jobs also when your dbjob is not using them. Thanks Best Regards On Sun, Jul 5, 2015 at

Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Gavin Liu
Hi, I am using TeraSort benchmark from ehiggs's branch https://github.com/ehiggs/spark-terasort https://github.com/ehiggs/spark-terasort . Then I noticed that in TeraSort.scala, it is using Kryo Serializer. So I made a small change from org.apache.spark.serializer.KryoSerializer to

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Akhil Das
Looks like, it spend more time writing/transferring the 40GB of shuffle when you used kryo. And surpirsingly, JavaSerializer has 700MB of shuffle? Thanks Best Regards On Sun, Jul 5, 2015 at 12:01 PM, Gavin Liu ilovesonsofanar...@gmail.com wrote: Hi, I am using TeraSort benchmark from

Futures timed out after 10000 milliseconds

2015-07-05 Thread SamRoberts
I have a very simple application, where I am intializing the Spark Context and using the context. The problem happens with both Spark 1.3.1 and 1.4.0; Scala 2.10.4; Java 1.7.0_79 Full Program import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ object SimpleApp {

RE: JDBC Streams

2015-07-05 Thread Ashic Mahtab
Hi Ayan,How continuous is your workload? As Akhil points out, with streaming, you'll give up at least one core for receiving, will need at most one more core for processing. Unless you're running on something like Mesos, this means that those cores are dedicated to your app, and can't be

Re: JDBC Streams

2015-07-05 Thread ayan guha
Thanks Akhil. In case I go with spark streaming, I guess I have to implment a custom receiver and spark streaming will call this receiver every batch interval, is that correct? Any gotcha you see in this plan? TIA...Best, Ayan On Sun, Jul 5, 2015 at 5:40 PM, Akhil Das ak...@sigmoidanalytics.com

Re: Spark got stuck with BlockManager after computing connected components using GraphX

2015-07-05 Thread Hellen
Sorry for the silly question. I'm fairly new to Spark. Because of the cleanup log messages, I didn't see scala, so I thought it's still working on something. If I press Enter, I got disconnected. I finally tried typing the variable name, which actually worked. -- View this message in context:

Re: JDBC Streams

2015-07-05 Thread ayan guha
Hi Thanks for the reply. here is my situation: I hve a DB which enbles synchronus CDC, think this as a DBtrigger which writes to a taable with changed values as soon as something changes in production table. My job will need to pick up the data as soon as it arrives which can be every 1 min

Spark custom streaming receiver not storing data reliably?

2015-07-05 Thread Ajit Bhingarkar
Hi, I am trying to integrate Drools rules API with Spark so that the solution could solve few CEP centric use cases. When I read data from a local file (simple FileWriter - readLine()), I see that all my rules are reliably fired and everytime I get the results as expected. I have tested with

Re: All master are unreponsive issue

2015-07-05 Thread Aaron Davidson
Are you seeing this after the app has already been running for some time, or just at the beginning? Generally, registration should only occur once initially, and a timeout would be due to the master not being accessible. Try telneting to the master IP/port from the machine on which the driver will

Re: Futures timed out after 10000 milliseconds

2015-07-05 Thread SamRoberts
Please note -- I am trying to run this with sbt run or spark-submit -- getting the same errors in both. Since I am in stand-alone mode, I assume I need not start the spark master, am I right. I realize this is probably a basic setup issue, but am unable to get past it. Any help will be

Re: .NET on Apache Spark?

2015-07-05 Thread Ruslan Dautkhanov
Scala used to run on .NET http://www.scala-lang.org/old/node/10299 -- Ruslan Dautkhanov On Thu, Jul 2, 2015 at 1:26 PM, pedro ski.rodrig...@gmail.com wrote: You might try using .pipe() and installing your .NET program as a binary across the cluster (or using addFile). Its not ideal to pipe

Re: Why Kryo Serializer is slower than Java Serializer in TeraSort

2015-07-05 Thread Will Briggs
That code doesn't appear to be registering classes with Kryo, which means the fully-qualified classname is stored with every Kryo record. The Spark documentation has more on this: https://spark.apache.org/docs/latest/tuning.html#data-serialization Regards, Will On July 5, 2015, at 2:31 AM,

RE: JDBC Streams

2015-07-05 Thread Ashic Mahtab
If it is indeed a reactive use case, then Spark Streaming would be a good choice. One approach worth considering - is it possible to receive a message via kafka (or some other queue). That'd not need any polling, and you could use standard consumers. If polling isn't an issue, then writing a

Re: configuring max sum of cores and memory in cluster through command line

2015-07-05 Thread Ruslan Dautkhanov
It's not possible to specify YARN RM paramers at command line of spark-submit time. You have to specify all resources that are available on your cluster to YARN upfront. If you want to limit amount of resource available for your Spark job, consider using YARN dynamic resource pools instead

Benchmark results between Flink and Spark

2015-07-05 Thread Slim Baltagi
Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and relational queries but not in batch processing! The results were published in the proceedings of the 18th International Conference, Business Information Systems 2015, Poznań, Poland, June 24-26, 2015.

RE: .NET on Apache Spark?

2015-07-05 Thread Ashic Mahtab
Unfortunately, afaik that project is long dead. It'd be an interesting project to create an intermediary protocol, perhaps using something that nearly everything these days understand (unfortunately [!] that might be JavaScript). For example, instead of pickling language constructs, it might be

Re: How to stop making Multiple copies in memory when running multiple Spark jobs?

2015-07-05 Thread Haoyuan Li
You can also find more info here: http://tachyon-project.org/master/Running-Spark-on-Tachyon.html Hope this helps. Haoyuan On Tue, Jun 30, 2015 at 11:28 PM, Himanshu Mehra himanshumehra@gmail.com wrote: Hi neprasad, You should give a try to Tachyon system. or any other in memory db.

Re: .NET on Apache Spark?

2015-07-05 Thread Jörn Franke
Ironpython shares with python only the syntax - at best. It is a scripting language within the .NET framework. Many applications have this for scripting the application itself. This won't work for you. You can use pipes or write your spark jobs in java/scala/r and submit them via your .NET

Re: Futures timed out after 10000 milliseconds

2015-07-05 Thread Ted Yu
Sam: bq. where would one set this timeout? With the following, it would be relatively easier to see which conf to change: [SPARK-6980] [CORE] Akka timeout exceptions indicate which conf controls them (RPC Layer) FYI On Sun, Jul 5, 2015 at 1:46 PM, Sean Owen so...@cloudera.com wrote: Usually

cores and resource management

2015-07-05 Thread nizang
hi, We're running spark 1.4.0 on ec2, with 6 machines, 4 cores each. We're trying to run an application on a number of total-executor-cores. but we want it to run on the minimal number of machines as possible (e.g. total-executor-cores=4, we'll want single machine. total-executor-cores=12, we'll

Re: Benchmark results between Flink and Spark

2015-07-05 Thread Ted Yu
There was no mentioning of the versions of Flink and Spark used in benchmarking. The size of cluster is quite small. Cheers On Sun, Jul 5, 2015 at 10:24 AM, Slim Baltagi sbalt...@gmail.com wrote: Hi Apache Flink outperforms Apache Spark in processing machine learning graph algorithms and

Re: Futures timed out after 10000 milliseconds

2015-07-05 Thread Sean Owen
Usually this message means that the test was starting some process like a Spark master and it didn't ever start. The eventual error is timeout. You have to try to dig in to the test and logs to catch the real reason. On Sun, Jul 5, 2015 at 9:23 PM, SamRoberts samueli.robe...@yahoo.com wrote:

efficiently accessing partition data for datasets in S3 with SparkSQL

2015-07-05 Thread Steve Lindemann
I'm trying to use SparkSQL to efficiently query structured data from datasets in S3. The data is naturally partitioned by date, so I've laid it out in S3 as follows: s3://bucket/dataset/dt=2015-07-05/ s3://bucket/dataset/dt=2015-07-04/ s3://bucket/dataset/dt=2015-07-03/ etc. In each directory,

Re: .NET on Apache Spark?

2015-07-05 Thread Silvio Fiorito
Joe Duffy, director of engineering on Microsoft's compiler team made a comment about investigating F# type providers for Spark. https://twitter.com/xjoeduffyx/status/614076012372955136 From: Ashic Mahtabmailto:as...@live.com Sent: ?Sunday?, ?July? ?5?, ?2015 ?1?:?29? ?PM To: Ruslan

Re: Futures timed out after 10000 milliseconds

2015-07-05 Thread SamRoberts
Also, it's not clear where to 1 millisec timeout is coming from. Can someone explain -- and if it's a legitimate timeout problem, where would one set this timeout? -- View this message in context:

Re: Spark-ImageAnalysis

2015-07-05 Thread Slim Baltagi
Hi Prakhar, How about you check the following related web resources to image processing using Spark. They are all listed in my Big Data Knowledge Base: http://www.SparkBigData.com: 1. Scaling Up Fast: Real-time Image Processing and Analytics using Spark - Kevin Mader (ETH Zurich) [VIDEO]

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Andy Huang
We have hit the same issue in spark shell when registering a temp table. We observed it happening with those who had JDK 6. The problem went away after installing jdk 8. This was only for the tutorial materials which was about loading a parquet file. Regards Andy On Sat, Jul 4, 2015 at 2:54 AM,

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Denny Lee
I had run into the same problem where everything was working swimmingly with Spark 1.3.1. When I switched to Spark 1.4, either by upgrading to Java8 (from Java7) or by knocking up the PermGenSize had solved my issue. HTH! On Mon, Jul 6, 2015 at 8:31 AM Andy Huang andy.hu...@servian.com.au

Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
Hi guys, I just read the paper too. There is no much information regarding why Flink is faster than Spark for data science type of workloads in the benchmark. It is very difficult to generalize the conclusion of a benchmark from my point of view. How much experience the author has with Spark is

lower and upper offset not working in spark with mysql database

2015-07-05 Thread Hafiz Mujadid
Hi all! I am trying to read records from offset 100 to 110 from a table using following piece of code val sc = new SparkContext(new SparkConf().setAppName(SparkJdbcDs).setMaster(local[*])) val sqlContext = new SQLContext(sc) val options = new HashMap[String, String]()

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Simeon Simeonov
The file is at https://www.dropbox.com/s/a00sd4x65448dl2/apache-spark-failure-data-part-0.gz?dl=1 The command was included in the gist SPARK_REPL_OPTS=-XX:MaxPermSize=256m spark-1.4.0-bin-hadoop2.6/bin/spark-shell --packages com.databricks:spark-csv_2.10:1.0.3 --driver-memory 4g

Aggregate to array (or 'slice by key') with DataFrames

2015-07-05 Thread Alex Beatson
Hello, I'm migrating some RDD-based code to using DataFrames. We've seen massive speedups so far! One of the operations in the old code creates an array of the values for each key, as follows: val collatedRDD = valuesRDD.mapValues(value=Array(value)).reduceByKey((array1,array2) =

RE: lower and upper offset not working in spark with mysql database

2015-07-05 Thread Manohar753
I think you should mention partitionColumn like below and the Colum type should be numeric. It works for my case. options.put(partitionColumn, revision); Thanks, Manohar From: Hafiz Mujadid [via Apache Spark User List] [mailto:ml-node+s1001560n23635...@n3.nabble.com] Sent: Monday,

Re: lower and upper offset not working in spark with mysql database

2015-07-05 Thread Hafiz Mujadid
thanks On Mon, Jul 6, 2015 at 10:46 AM, Manohar753 [via Apache Spark User List] ml-node+s1001560n23637...@n3.nabble.com wrote: I think you should mention partitionColumn like below and the Colum type should be numeric. It works for my case. options.put(partitionColumn, revision);

DESCRIBE FORMATTED doesn't work in Hive Thrift Server?

2015-07-05 Thread Rex Xiong
Hi, I try to use for one table created in spark, but it seems the results are all empty, I want to get metadata for table, what's other options? Thanks +---+ |result | +---+ | # col_name| |

RE: Benchmark results between Flink and Spark

2015-07-05 Thread nate
Maybe some flink benefits from some pts they outline here: http://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html Probably if re-ran the benchmarks with 1.5/tungsten line would close the gap a bit(or a lot) with spark moving towards similar style off-heap memory mgmt,

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Simeon Simeonov
Yin, With 512Mb PermGen, the process still hung and had to be kill -9ed. At 1Gb the spark shell associated processes stopped hanging and started exiting with scala println(dfCount.first.getLong(0)) 15/07/06 00:10:07 INFO storage.MemoryStore: ensureFreeSpace(235040) called with curMem=0,

Can we allow executor to exit when tasks fail too many time?

2015-07-05 Thread Tao Li
I have a long live spark application running on YARN. In some nodes, it try to write to the shuffle path in the shuffle map task. But the root path /search/hadoop10/yarn_local/usercache/spark/ was deleted, so the task is failed. So every time when running shuffle map task on this node, it was

Re: Futures timed out after 10000 milliseconds

2015-07-05 Thread SamRoberts
Thanks.. tried local[*] -- it didn't help. I agree that it is something to do with the SparkContext.. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Futures-timed-out-after-1-milliseconds-tp23622p23633.html Sent from the Apache Spark User List mailing

Re: Can we allow executor to exit when tasks fail too many time?

2015-07-05 Thread Tao Li
​ Node cloud10141049104.wd.nm.nop.sogou-op.org and cloud101417770.wd.nm.ss.nop.sogou-op.org failed too many times, I want to know if it can be auto offline when failed too many times? 2015-07-06 12:25 GMT+08:00 Tao Li litao.bupt...@gmail.com: I have a long live spark application running on

Re: Futures timed out after 10000 milliseconds

2015-07-05 Thread SamRoberts
One more data point -- sbt seems to have a bigger problem with this than spark-submit. With spark-submit, I am able to get it to run several times, while sbt fails most of the time (or more recently all the time). -- View this message in context:

Re: Spark custom streaming receiver not storing data reliably?

2015-07-05 Thread Jörn Franke
Can you provide the result set you are using and specify how you integrated the drools engine? Drools basically is based on a large shared memory. Hence, if you have several tasks in Shark they end up using different shared memory areas. A full integration of drools requires some sophisticated

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
I have never seen issue like this. Setting PermGen size to 256m should solve the problem. Can you send me your test file and the command used to launch the spark shell or your application? Thanks, Yin On Sun, Jul 5, 2015 at 9:17 PM, Simeon Simeonov s...@swoop.com wrote: Yin, With 512Mb

java.io.IOException: No space left on device--regd.

2015-07-05 Thread Devarajan Srinivasan
Hi , I am trying to run an ETL on spark which involves expensive shuffle operation. Basically I require a self-join to be performed on a sparkDataFrame RDD . The job runs fine for around 15 hours and when the stage(which performs the sef-join) is about to complete, I get a *java.io.IOException:

Re: 1.4.0 regression: out-of-memory errors on small data

2015-07-05 Thread Yin Huai
Sim, Can you increase the PermGen size? Please let me know what is your setting when the problem disappears. Thanks, Yin On Sun, Jul 5, 2015 at 5:59 PM, Denny Lee denny.g@gmail.com wrote: I had run into the same problem where everything was working swimmingly with Spark 1.3.1. When I