Re: Elastic allocation(spark.dynamicAllocation.enabled) results in task never being executed.

2015-01-04 Thread Tsuyoshi Ozawa
Please check the document added by Andrew. I could run tasks with Spark 1.2.0. * https://github.com/apache/spark/pull/3731/files#diff-c3cbe4cabe90562520f22d2306aa9116R86 * https://github.com/apache/spark/pull/3757/files#diff-c3cbe4cabe90562520f22d2306aa9116R101 Thanks, - Tsuyoshi On Sun, Jan

RE: Better way of measuring custom application metrics

2015-01-04 Thread Shao, Saisai
I started to know your requirement, maybe there’s some limitations in current MetricsSystem, I think we can improve it either. Thanks Jerry From: Enno Shioji [mailto:eshi...@gmail.com] Sent: Sunday, January 4, 2015 5:46 PM To: Shao, Saisai Cc: user@spark.apache.org Subject: Re: Better way of

RE: does calling cache()/persist() on a RDD trigger its immediate evaluation?

2015-01-04 Thread Kapil Malik
Hi Pengcheng YIN, RDD cache / persist calls do not trigger evaluation. Unpersist call is blocking (it does have an async flavor but am not sure what are the SLAs on behavior). val rdd = sc.textFile().map() rdd.persist() // This does not trigger actual storage while(true){ val count =

Re: building spark1.2 meet error

2015-01-04 Thread xhudik
Hi J_soft mvn do not provide tar packages by default. You got many jar files - each project has its own jar (e.g. mllib has mllib/target/spark-mllib_2.10-1.2.0.jar). However, if you want one big tar package with all dependencies - look here: https://github.com/apache/spark/tree/master/assembly

Performance degrade

2015-01-04 Thread Abhideep Chakravarty
Hi, We have recently upgraded to latest version of Spark and suddenly the Spark SQLs are performing bad i.e. response time got increased from sub-seconds to 5-6 seconds. What all info I need to provide so that I can get to know the reason for this performance degrade ? Regards, Abhideep

Problem with building spark-1.2.0

2015-01-04 Thread Kartheek.R
Hi, I get the following error when I build spark-1.2.0 using sbt: [error] Nonzero exit code (128): git clone https://github.com/ScrapCodes/sbt-pom-reader.git /home/karthik/.sbt/0.13/staging/ad8e8574a5bcb2d22d23/sbt-pom-reader [error] Use 'last' for the full log. Any help please? Thanks --

A spark newbie question

2015-01-04 Thread Dinesh Vallabhdas
A spark cassandra newbie question. Thanks in advance for the help.I have a cassandra table with 2 columns message_timestamp(timestamp) and  message_type(text). The data is of the form2014-06-25 12:01:39 START 2014-06-25 12:02:39 START 2014-06-25 12:02:39 PAUSE 2014-06-25 14:02:39 STOP 2014-06-25

A spark newbie question on summary statistics

2015-01-04 Thread anondin
A spark cassandra newbie question. Appreciate the help.u...@host.com I have a cassandra table with 2 columns message_timestamp(timestamp) and message_type(text). The data is of the form 2014-06-25 12:01:39 START 2014-06-25 12:02:39 START 2014-06-25 12:02:39 PAUSE 2014-06-25 14:02:39 STOP

Re: Problem with building spark-1.2.0

2015-01-04 Thread Kartheek.R
The problem is that my network is not able to access github.com for cloning some dependencies as github is blocked in India. What are the other possible ways for this problem?? Thank you! On Sun, Jan 4, 2015 at 9:45 PM, Rapelly Kartheek kartheek.m...@gmail.com wrote: Hi, I get the following

Re: Problem with building spark-1.2.0

2015-01-04 Thread Ted Yu
Have you used Google to find some way of accessing github :-) On Jan 4, 2015, at 8:46 AM, Kartheek.R kartheek.m...@gmail.com wrote: The problem is that my network is not able to access github.com for cloning some dependencies as github is blocked in India. What are the other possible

Re: Problem with building spark-1.2.0

2015-01-04 Thread Rapelly Kartheek
yeah.. but none of the sites get opened. On Sun, Jan 4, 2015 at 10:35 PM, Ted Yu yuzhih...@gmail.com wrote: Have you used Google to find some way of accessing github :-) On Jan 4, 2015, at 8:46 AM, Kartheek.R kartheek.m...@gmail.com wrote: The problem is that my network is not able to

Re: Problem with building spark-1.2.0

2015-01-04 Thread xhudik
The error you provided says that build was unsuccessful. If you write what you did (what command you used), whole error trace - someone might be able to help you ... -- View this message in context:

Re: Better way of measuring custom application metrics

2015-01-04 Thread Enno Shioji
Hi Jerry, thanks for your answer. I had looked at MetricsSystem, but I couldn't see how I could use it in my use case, which is: stream .map { i = Metriker.mr.meter(Metriker.metricName(testmetric123)).mark(i) i * 2 } From what I can see, a Source

Repartition Memory Leak

2015-01-04 Thread Brad Willard
I have a 10 node cluster with 600gb of ram. I'm loading a fairly large dataset from json files. When I load the dataset it is about 200gb however it only creates 60 partitions. I'm trying to repartition to 256 to increase cpu utilization however when I do that it balloons in memory to way over 2x

Re: A spark newbie question

2015-01-04 Thread Aniket Bhatnagar
Go through spark API documentation. Basically you have to do group by (date, message_type) and then do a count. On Sun, Jan 4, 2015, 9:58 PM Dinesh Vallabhdas dines...@yahoo.com.invalid wrote: A spark cassandra newbie question. Thanks in advance for the help. I have a cassandra table with 2

Re: A spark newbie question

2015-01-04 Thread Sanjay Subramanian
val sconf = new SparkConf().setMaster(local).setAppName(MedicalSideFx-CassandraLogsMessageTypeCount) val sc = new SparkContext(sconf)val inputDir = /path/to/cassandralogs.txt sc.textFile(inputDir).map(line = line.replace(\, )).map(line = (line.split(' ')(0) + + line.split(' ')(2),

Re: Shuffle Problems in 1.2.0

2015-01-04 Thread Josh Rosen
It doesn’t seem like there’s a whole lot of clues to go on here without seeing the job code.  The original org.apache.spark.SparkException: PairwiseRDD: unexpected value: List([B@130dc7ad)” error suggests that maybe there’s an issue with PySpark’s serialization / tracking of types, but it’s

Re: spark.akka.frameSize limit error

2015-01-04 Thread Josh Rosen
Ah, so I guess this *is* still an issue since we needed to use a bitmap for tracking zero-sized blocks (see https://issues.apache.org/jira/browse/SPARK-3740; this isn't just a performance issue; it's necessary for correctness). This will require a bit more effort to fix, since we'll either have

Reading one partition at a time

2015-01-04 Thread Michael Albert
Greetings! I would like to know if the code below will read one-partition-at-a-time, and whether I am reinventing the wheel. If I may explain, upstream code has managed (I hope) to save an RDD such that each partition file (e.g, part-r-0, part-r-1) contains exactly the data subset

a vague question, but perhaps it might ring a bell

2015-01-04 Thread Michael Albert
Greetings! So, I think I have data saved so that each partition (part-r-0, etc)is exactly what I wan to translate into an output file of a format not related to  hadoop. I believe I've figured out how to tell Spark to read the data set without re-partitioning (in another post I mentioned

Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK
When I use a single IF statement like select IF(col1 != , col1+'$'+col3, col2+'$'+col3) from my_table, it works fine. However, when I use a nested IF like select IF(col1 != , col1+'$'+col3, IF(col2 != , col2+'$'+col3, '$')) from my_table, I am getting the following  exception. Exception in

Launching Spark app in client mode for standalone cluster

2015-01-04 Thread Boromir Widas
Hello, I am trying to launch a Spark app(client mode for standalone cluster) from a Spray server, using the following code. When I run it as $ java -cp class paths SprayServer the SimpleApp.getA() call from SprayService returns -1(which means it sees the logData RDD as null for HTTP

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK
BTW, I am seeing this issue in Spark 1.1.1. On Sunday, January 4, 2015 7:29 PM, RK prk...@yahoo.com.INVALID wrote: When I use a single IF statement like select IF(col1 != , col1+'$'+col3, col2+'$'+col3) from my_table, it works fine. However, when I use a nested IF like select IF(col1

Driver hangs on running mllib word2vec

2015-01-04 Thread Eric Zhen
Hi, When we run mllib word2vec(spark-1.1.0), driver get stuck with 100% cup usage. Here is the jstack output: main prio=10 tid=0x40112800 nid=0x46f2 runnable [0x4162e000] java.lang.Thread.State: RUNNABLE at

Re: Shuffle write increases in spark 1.2

2015-01-04 Thread 정재부
Sure, here is a ticket. https://issues.apache.org/jira/browse/SPARK-5081 --- Original Message --- Sender : Josh Rosenrosenvi...@gmail.com Date : 2015-01-05 06:14 (GMT+09:00) Title : Re: Shuffle write increases in spark 1.2 If you have a small reproduction for this issue,

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-04 Thread RK
The issue is happening when I try to concatenate column values in the query like col1+'$'+col3. For some reason, this issue is not manifesting itself when I do a single IF query. Is there a concat function in SparkSQL? I can't find anything in the documentation. Thanks,RK On Sunday,

Re: Spark for core business-logic? - Replacing: MongoDB?

2015-01-04 Thread Jörn Franke
Hallo, It really depends on your requirements, what kind of machine learning algorithm your budget, if you do currently something really new or integrate it with an existing application, etc.. You can run MongoDB as well as a cluster. I don't think this question can be answered generally, but

Controlling number of executors on Mesos vs YARN

2015-01-04 Thread mvle
I'm trying to compare the performance of Spark running on Mesos vs YARN. However, I am having problems being able to configure the Spark workload to run in a similar way on Mesos and YARN. When running Spark on YARN, you can specify the number of executors per node. So if I have a node with 4

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Adam Gilmore
Just an update on this - I found that the script by Amazon was the culprit - not exactly sure why. When I installed Spark manually onto the EMR (and did the manual configuration of all the EMR stuff), it worked fine. On Mon, Dec 22, 2014 at 11:37 AM, Adam Gilmore dragoncu...@gmail.com wrote:

Re: Launching Spark app in client mode for standalone cluster

2015-01-04 Thread Simon Chan
Boromir, You may like to take a look at how we make Spray and Spark working together at the PredictionIO project: https://github.com/PredictionIO/PredictionIO Simon On Sun, Jan 4, 2015 at 8:31 PM, Chester At Work ches...@alpinenow.com wrote: Just a guess here, may not be correct. Spray

Re: Parquet schema changes

2015-01-04 Thread Adam Gilmore
I saw that in the source, which is why I was wondering. I was mainly reading: http://blog.cloudera.com/blog/2013/10/parquet-at-salesforce-com/ A query that tries to parse the organizationId and userId from the 2 logTypes should be able to do so correctly, though they are positioned differently

Re: a vague question, but perhaps it might ring a bell

2015-01-04 Thread Akhil Das
What are you trying to do? Can you paste the whole code? I used to see this sort of Exception when i close the fs object inside map/mapPartition etc. Thanks Best Regards On Mon, Jan 5, 2015 at 6:43 AM, Michael Albert m_albert...@yahoo.com.invalid wrote: Greetings! So, I think I have data

Re: Repartition Memory Leak

2015-01-04 Thread Josh Rosen
@Brad, I'm guessing that the additional memory usage is coming from the shuffle performed by coalesce, so that at least explains the memory blowup. On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das ak...@sigmoidanalytics.com wrote: You can try: - Using KryoSerializer - Enabling RDD Compression -

Re: Issue with Parquet on Spark 1.2 and Amazon EMR

2015-01-04 Thread Aniket Bhatnagar
Can you confirm your emr version? Could it be because of the classpath entries for emrfs? You might face issues with using S3 without them. Thanks, Aniket On Mon, Jan 5, 2015, 11:16 AM Adam Gilmore dragoncu...@gmail.com wrote: Just an update on this - I found that the script by Amazon was the

spark worker nodes getting disassociated while running hive on spark

2015-01-04 Thread Somnath Pandeya
Hi, I have setup the spark 1.2 standalone cluster and trying to run hive on spark by following below link. https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started I got the latest build of hive on spark from git and was trying to running few queries. Queries are

python API for gradient boosting?

2015-01-04 Thread Christopher Thom
Hi, I wonder if anyone knows when a python API will be added for Gradient Boosted Trees? I see that java and scala APIs were added for the 1.2 release, and would love to be able to build GBMs in pyspark too. cheers chris Christopher Thom QUANTIUM Level 25, 8 Chifley, 8-12 Chifley Square

Re: Repartition Memory Leak

2015-01-04 Thread Akhil Das
You can try: - Using KryoSerializer - Enabling RDD Compression - Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER Thanks Best Regards On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard bradwill...@gmail.com wrote: I have a 10 node cluster with 600gb of ram. I'm loading a fairly