Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-13 Thread Akhil Das
I figured that out, And these are my findings: - It just enters in an infinite loop when there's a duplicate partition id. - It enters in an infinite loop when the partition id starts from 1 rather than 0 Something like this piece of code can reproduce it: (in getPartitions()) val

Re: Error writing to cassandra table using spark application

2015-08-13 Thread Akhil Das
Can you look in the worker logs and see what going wrong? Thanks Best Regards On Wed, Aug 12, 2015 at 9:53 PM, Nupur Kumar (BLOOMBERG/ 731 LEX) nkumar...@bloomberg.net wrote: Hello, I am doing this for the first time so feel free to let me know/forward this to where it needs to be if not

Spark Streaming failing on YARN Cluster

2015-08-13 Thread Ramkumar V
Hi, I have a cluster of 1 master and 2 slaves. I'm running a spark streaming in master and I want to utilize all nodes in my cluster. i had specified some parameters like driver memory and executor memory in my code. when i give --deploy-mode cluster --master yarn-cluster in my spark-submit, it

Re: collect() works, take() returns ImportError: No module named iter

2015-08-13 Thread YaoPau
In case anyone runs into this issue in the future, we got it working: the following variable must be set on the edge node: export PYSPARK_PYTHON=/your/path/to/whatever/python/you/want/to/run/bin/python I didn't realize that variable gets passed to every worker node. All I saw when searching for

UnknownHostNameException looking up host name with 64 characters

2015-08-13 Thread Jeff Jones
I've got a Spark application running on a host with 64 character FQDN. When running with Spark master local[*] I get the following error. Note, the host name should be ip-10-248-0-177.us-west-2.compute.internaldna.corp.adaptivebiotech.com but the last 6 characters are missing. The same

Re: ClassNotFound spark streaming

2015-08-13 Thread Akhil Das
You need to add that jar in the classpath. While submitting the job, you can use --jars, --driver-classpath etc configurations to add the jar. Apart from that if you are running the job as a standalone application, then you can use the sc.addJar option to add the jar (which will ship this jar into

Re: dse spark-submit multiple jars issue

2015-08-13 Thread Javier Domingo Cansino
Please notice that 'jars: null' I don't know why you put ///. but I would propose you just put normal absolute paths. dse spark-submit --master spark://10.246.43.15:7077 --class HelloWorld --jars /home/missingmerch/postgresql-9.4-1201.jdbc41.jar /home/missingmerch/dse.jar

Spark Streaming: Change Kafka topics on runtime

2015-08-13 Thread Nisrina Luthfiyati
Hi all, I want to write a Spark Streaming program that listens to Kafka for a list of topics. The list of topics that I want to consume is stored in a DB and might change dynamically. I plan to periodically refresh this list of topics in the Spark Streaming app. My question is is it possible to

serialization issue

2015-08-13 Thread 周千昊
Hi, I am using spark 1.4 when an issue occurs to me. I am trying to use the aggregate function: JavaRddString rdd = some rdd; HashMapLong, TypeA zeroValue = new HashMap(); // add initial key-value pair for zeroValue rdd.aggregate(zeroValue, new

Re: Does Spark optimization might miss to run transformation?

2015-08-13 Thread Michael Armbrust
-dev If you want to guarantee the side effects happen you should use foreach or foreachPartitions. A `take`, for example, might only evaluate a subset of the partitions until it find enough results. On Wed, Aug 12, 2015 at 7:06 AM, Eugene Morozov fathers...@list.ru wrote: Hi! I’d like to

Re: make-distribution.sh failing at spark/R/lib/sparkr.zip

2015-08-13 Thread MEETHU MATHEW
Hi, It worked after removing that line. Thank you for the response and fix .  Thanks Regards, Meethu M On Thursday, 13 August 2015 4:12 AM, Burak Yavuz brk...@gmail.com wrote: For the record:https://github.com/apache/spark/pull/8147 https://issues.apache.org/jira/browse/SPARK-9916

Re: Spark Streaming failing on YARN Cluster

2015-08-13 Thread Ramkumar V
Yes. this file is available in this path in the same machine where i'm running the spark. later i moved spark-1.4.1 folder to all other machines in my cluster but still i'm facing the same issue. *Thanks*, https://in.linkedin.com/in/ramkumarcs31 On Thu, Aug 13, 2015 at 1:17 PM, Akhil Das

Re: Controlling number of executors on Mesos vs YARN

2015-08-13 Thread Ajay Singal
Hi Tim, An option like spark.mesos.executor.max to cap the number of executors per node/application would be very useful. However, having an option like spark.mesos.executor.num to specify desirable number of executors per node would provide even/much better control. Thanks, Ajay On Wed, Aug

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Hemant Bhanawat
A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info wrote: Hello everyone, I am wondering what the effect of serialization is within a stage. My understanding of Spark as an execution engine is

Re: serialization issue

2015-08-13 Thread Anish Haldiya
While submitting the job, you can use --jars, --driver-classpath etc configurations to add the jar. Apart from that if you are running the job as a standalone application, then you can use the sc.addJar option to add the jar (which will ship this jar into all the executors) Regards, Anish On

Re: Unit Testing

2015-08-13 Thread jay vyas
yes there certainly is, so long as eclipse has the right plugins and so on to run scala programs. You're really asking two questions: (1) Can I use a modern IDE to develop spark apps and (2) can we easily unit test spark streaming apps. the answer is yes to both... Regarding your IDE: I like

Re: DataFrame column structure change

2015-08-13 Thread Eugene Morozov
I have a pretty complex nested structure with several levels. So in order to create it I use SQLContext.createDataFrame method and provide specific Rows with specific StrucTypes, both of which I build myself. To build a Row I iterate over my values and literally build a Row. ListObject

Re: Spark - Standalone Vs YARN Vs Mesos

2015-08-13 Thread ๏̯͡๏
I am looking to decide what is best for my production grade spark application(s). YARN = 1. YARN supports security. When Spark is run over YARN the communication between processes can use secure authentication through Kerberos. 2. Spark standalone cluster can only run Spark jobs and

Streaming on Exponential Data

2015-08-13 Thread UMESH CHAUDHARY
Hi, I was working with non-reliable receiver version of Spark-Kafka streaming i.e. KafkaUtils,createStream... where for testing purpose I was getting data at constant rate from kafka and it was acting as expected. But when there was exponential data in Kafka, my program started crashing saying

Re: Unit Testing

2015-08-13 Thread Burak Yavuz
I would recommend this spark package for your unit testing needs ( http://spark-packages.org/package/holdenk/spark-testing-base). Best, Burak On Thu, Aug 13, 2015 at 5:51 AM, jay vyas jayunit100.apa...@gmail.com wrote: yes there certainly is, so long as eclipse has the right plugins and so on

spark.streaming.maxRatePerPartition parameter: what are the benefits?

2015-08-13 Thread allonsy
Hello everyone, in the new Kafka Direct API, what are the benefits of setting a value for *spark.streaming.maxRatePerPartition*? In my case, I have 2 seconds batches consuming ~15k tuples from a topic split into 48 partitions (4 workers, 16 total cores). Is there any particular value I should

Re: Question regarding join with multiple columns with pyspark

2015-08-13 Thread Dan LaBar
The DataFrame issue has been fixed in Spark 1.5. Refer to SPARK-7990 https://issues.apache.org/jira/browse/SPARK-7990 and Stackoverflow: Spark specify multiple column conditions for dataframe join http://stackoverflow.com/a/31889190/1344789. On Tue, Apr 28, 2015 at 12:55 PM, Ali Bajwa

Re: Write to cassandra...each individual statement

2015-08-13 Thread Philip Weaver
All you'd need to do is *transform* the rdd before writing it, e.g. using the .map function. On Thu, Aug 13, 2015 at 11:30 AM, Priya Ch learnings.chitt...@gmail.com wrote: Hi All, I have a question in writing rdd to cassandra. Instead of writing entire rdd to cassandra, i want to write

回复:Spark DataFrames uses too many partition

2015-08-13 Thread prosp4300
Hi, I want to know how you coalesce the partition to one to improve the performance Thanks 在2015年08月11日 23:31,Al M 写道: I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading lots of very small files and joining them together.

Re: Spark Streaming: Change Kafka topics on runtime

2015-08-13 Thread Cody Koeninger
The current kafka stream implementation assumes the set of topics doesn't change during operation. You could either take a crack at writing a subclass that does what you need; stop/start; or if your batch duration isn't too small, you could run it as a series of RDDs (using the existing

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Mark Heimann
Thanks a lot guys, that's exactly what I hoped for :-). Cheers, Mark 2015-08-13 6:35 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com: A chain of map and flatmap does not cause any serialization-deserialization. On Wed, Aug 12, 2015 at 4:02 PM, Mark Heimann mark.heim...@kard.info wrote:

Re: What is the Effect of Serialization within Stages?

2015-08-13 Thread Zoltán Zvara
Serialization only occurs intra-stage, when you are using Python, and as far as I know, only in the first stage, when reading the data and passing it to the Python interpreter the first time. Multiple operations are just chains of simple *map *and *flatMap *operators at task level on simple Scala

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Yin Huai
Hi Todd, We have not got a chance to update it. We will update it after 1.5 release. Thanks, Yin On Thu, Aug 13, 2015 at 6:49 AM, Todd bit1...@163.com wrote: Hi, I got a question about the spark-sql-perf project by Databricks at https://github.com/databricks/spark-sql-perf/ The

Write to cassandra...each individual statement

2015-08-13 Thread Priya Ch
Hi All, I have a question in writing rdd to cassandra. Instead of writing entire rdd to cassandra, i want to write individual statement into cassandra beacuse there is a need to perform to ETL on each message ( which requires checking with the DB). How could i insert statements individually?

Re: Write to cassandra...each individual statement

2015-08-13 Thread Priya Ch
Hi Philip, I have the following requirement - I read the streams of data from various partitions of kafka topic. And then I union the dstreams and apply hash partitioner so messages of same key would go into single partition of an rdd, which is ofcourse handled by a single thread. This way we

RE: Create column in nested structure?

2015-08-13 Thread Ewan Leith
Never mind me, I've found an email to this list from Raghavendra Pandey which got me what I needed val nestedCol = struct(df(nested2.column1), df(nested2.column2), df(flatcolumn)) val df2 = df.select(df(nested1), nestedCol as nested2) Thanks, Ewan From: Ewan Leith Sent: 13 August 2015 15:44

Re: saveToCassandra not working in Spark Job but works in Spark Shell

2015-08-13 Thread satish chandra j
HI, Please let me know if I am missing anything in the below mail, to get the issue fixed Regards, Satish Chandra On Wed, Aug 12, 2015 at 6:59 PM, satish chandra j jsatishchan...@gmail.com wrote: HI, The below mentioned code is working very well fine in Spark Shell but when the same is

Fwd: - Spark 1.4.1 - run-example SparkPi - Failure ...

2015-08-13 Thread Naga Vij
Hello, Any idea on why this is happening? Thanks Naga -- Forwarded message -- From: Naga Vij nvbuc...@gmail.com Date: Wed, Aug 12, 2015 at 5:47 PM Subject: - Spark 1.4.1 - run-example SparkPi - Failure ... To: user@spark.apache.org Hi, I am evaluating Spark 1.4.1 Any idea on

Spark 1.3.0: ExecutorLostFailure depending on input file size

2015-08-13 Thread Wyss Michael (wysm)
Hi I've been at this problem for a few days now and wasn't able to solve it. I'm hoping that I'm missing something that you don't! I'm trying to run a simple python application on a 2-node-cluster I set up in standalone mode. A master and a worker, whereas the master also takes on the role of a

Create column in nested structure?

2015-08-13 Thread Ewan Leith
Has anyone used withColumn (or another method) to add a column to an existing nested dataframe? If I call: df.withColumn(nested.newcolumn, df(oldcolumn)) then it just creates the new column with a . In it's name, not under the nested structure. Thanks, Ewan

Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I have this call trying to save to hdfs 2.6 wordCounts.saveAsNewAPIHadoopFiles(prefix, txt); but I am getting the following: java.lang.RuntimeException: class scala.runtime.Nothing$ not org.apache.hadoop.mapreduce.OutputFormat

Re: Write to cassandra...each individual statement

2015-08-13 Thread Philip Weaver
So you need some state between messages in a partition. You can use mapPartitions or foreachPartition, which allow you to write code to process an entire partition. On Thu, Aug 13, 2015 at 11:48 AM, Priya Ch learnings.chitt...@gmail.com wrote: Hi Philip, I have the following requirement - I

Re: Spark - Standalone Vs YARN Vs Mesos

2015-08-13 Thread ๏̯͡๏
What are ideas around Spark cluster for streaming purposes ? What is better standalone / Mesos / YARN ? Please share cluster details and size of data and type of processing. (multiple processing points) (architecture or similar) I see folks using YARN cluster for streaming purposes. Regards,

Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Matt Forbes
I am training a boosted trees model on a couple million input samples (with around 300 features) and am noticing that the input size of each stage is increasing each iteration. For each new tree, the first step seems to be building the decision tree metadata, which does a .count() on the input

Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Stephen Durfey
When deploying a spark streaming application I want to be able to retrieve the lastest kafka offsets that were processed by the pipeline, and create my kafka direct streams from those offsets. Because the checkpoint directory isn't guaranteed to be compatible between job deployments, I don't want

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Sean Owen
Not that I have any answer at this point, but I was discussing this exact same problem with Johannes today. An input size of ~20K records was growing each iteration by ~15M records. I could not see why on a first look. @jkbradley I know it's not much info but does that ring any bells? I think

Re: spark.streaming.maxRatePerPartition parameter: what are the benefits?

2015-08-13 Thread Cody Koeninger
It's just to limit the maximum number of records a given executor needs to deal with in a given batch. Typical usage would be if you're starting a stream from the beginning of a kafka log, or after a long downtime, and don't want ALL of the messages in the first batch. On Thu, Aug 13, 2015 at

Re: Input size increasing every iteration of gradient boosted trees [1.4]

2015-08-13 Thread Matt Forbes
Is this an artifact of a recent change? Does this not show up in any of the tests or benchmarks? On Thu, Aug 13, 2015 at 2:33 PM, Sean Owen so...@cloudera.com wrote: Not that I have any answer at this point, but I was discussing this exact same problem with Johannes today. An input size of

Re:Re: About Databricks's spark-sql-perf

2015-08-13 Thread Todd
Thanks Michael for the answer. I will watch the project and hope the update will be coming soon, :-) At 2015-08-14 02:13:32, Michael Armbrust mich...@databricks.com wrote: Hey sorry, I've been doing a bunch of refactoring on this project. Most of the data generation was a huge hack (it was

Re: Retrieving offsets from previous spark streaming checkpoint

2015-08-13 Thread Cody Koeninger
Access the offsets using HasOffsetRanges, save them in your datastore, provide them as the fromOffsets argument when starting the stream. See https://github.com/koeninger/kafka-exactly-once On Thu, Aug 13, 2015 at 3:53 PM, Stephen Durfey sjdur...@gmail.com wrote: When deploying a spark

Custom serialization and checkpointing

2015-08-13 Thread Tech Meme
Hi Guys, We need to do some state checkpointing (an rdd thats updated using updateStateByKey). We would like finer control over the serialization. Also, this would allow us to do schema evolution in the deserialization code when we need to modify the structure of the classes associated with the

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-13 Thread Imran Rashid
oh I see, you are defining your own RDD Partition types, and you had a bug where partition.index did not line up with the partitions slot in rdd.getPartitions. Is that correct? On Thu, Aug 13, 2015 at 2:40 AM, Akhil Das ak...@sigmoidanalytics.com wrote: I figured that out, And these are my

New Spark User - GBM iterations and Spark benchmarks

2015-08-13 Thread Sereday, Scott
I am exploring utilizing Spark as my dataset is becoming more and more difficult to manage and analyze. I'd appreciate if anyone could provide feedback on the following questions for me: * I am especially interested in training large datasets using machine learning algorithms. o

Re: About Databricks's spark-sql-perf

2015-08-13 Thread Michael Armbrust
Hey sorry, I've been doing a bunch of refactoring on this project. Most of the data generation was a huge hack (it was done before we supported partitioning natively) and used some private APIs that don't exist anymore. As a result, while doing the regression tests for 1.5 I deleted a bunch of

MatrixFactorizationModel.save got StackOverflowError

2015-08-13 Thread Benyi Wang
I'm using spark-1.4.1 and compile it against CDH5.3.2. When I use ALS.trainImplicit to build a model, I got this error when rank=40 and iterations=30. It worked for (rank=10, iteration=10) and (rank=20, iteration=20). What was wrong with (rank=40, iterations=30)? 15/08/13 01:16:40 INFO

how do you convert directstream into data frames

2015-08-13 Thread Mohit Durgapal
Hi All, After creating a direct stream like below: val events = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet) I would like to convert the above stream into data frames, so that I could run hive queries over it. Could anyone

Always two tasks slower than others, and then job fails

2015-08-13 Thread randylu
It is strange that there are always two tasks slower than others, and the corresponding partitions's data are larger, no matter how many partitions? Executor ID Address Task Time Shuffle Read Size / Records 1 slave129.vsvs.com:56691 16 s1 99.5 MB /

Re: how do you convert directstream into data frames

2015-08-13 Thread Mohit Durgapal
Any idea anyone? On Fri, Aug 14, 2015 at 10:11 AM, Mohit Durgapal durgapalmo...@gmail.com wrote: Hi All, After creating a direct stream like below: val events = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet) I would

Reduce number of partitions before saving to file. coalesce or repartition?

2015-08-13 Thread Alexander Pivovarov
Hi Everyone Which one should work faster (coalesce or repartition) if I need to reduce number of partitions from 5000 to 3 before saving RDD asTextFile Total data size is about 400MB on disk in text format Thank you

worker and executor memory

2015-08-13 Thread James Pirz
Hi, I am using Spark 1.4 on a cluster (stand-alone mode), across 3 machines, for a workload similar to TPCH (analytical queries with multiple/multi-way large joins and aggregations). Each machine has 12GB of Memory and 4 cores. My total data size is 150GB, stored in HDFS (stored as Hive tables),

Re: Spark RuntimeException hadoop output format

2015-08-13 Thread Mohit Anchlia
I was able to get this working by using an alternative method however I only see 0 bytes files in hadoop. I've verified that the output does exist in the logs however it's missing from hdfs. On Thu, Aug 13, 2015 at 10:49 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I have this call trying to

matrix inverse and multiplication

2015-08-13 Thread go canal
Hello,I am new to Spark. I am looking for a matrix inverse and multiplication solution. I did a quick search and found a couple of solutions but my requirements are:- large matrix (up to 2 millions x 2 m)- need to support complex double data type- preferably in Java  There is one post 

Re: graphx class not found error

2015-08-13 Thread Ted Yu
The code and error didn't go through. Mind sending again ? Which Spark release are you using ? On Thu, Aug 13, 2015 at 6:17 PM, dizzy5112 dave.zee...@gmail.com wrote: the code below works perfectly on both cluster and local modes but when i try to create a graph in cluster mode (it works

Re: graphx class not found error

2015-08-13 Thread dizzy5112
Oh forgot to note using the Scala REPL for this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/graphx-class-not-found-error-tp24253p24254.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Materials for deep insight into Spark SQL

2015-08-13 Thread Todd
Hi, I would ask whether there are slides, blogs or videos on the topic about how spark sql is implemented, the process or the whole picture when spark sql executes the code, Thanks!.

Re: Materials for deep insight into Spark SQL

2015-08-13 Thread Ted Yu
You can look under Developer Track: https://spark-summit.org/2015/#day-1 http://www.slideshare.net/jeykottalam/spark-sqlamp-camp2014?related=1 (slightly old) Catalyst design: https://docs.google.com/a/databricks.com/document/d/1Hc_Ehtr0G8SQUg69cmViZsMi55_Kf3tISD9GPGU5M1Y/edit FYI On Thu, Aug

Driver staggering task launch times

2015-08-13 Thread Ara Vartanian
I’m observing an unusual situation where my step duration increases as I add further executors to my cluster. My algorithm is fully data parallelizable into a map phase, followed by a reduce step at the end that amounts to matrix addition. So I’ve kicked a cluster of, say, 100 executors with 4

graphx class not found error

2015-08-13 Thread dizzy5112
the code below works perfectly on both cluster and local modes but when i try to create a graph in cluster mode (it works in local mode) I get the following error: any help appreciated -- View this message in context:

Re: ERROR Executor java.lang.NoClassDefFoundError

2015-08-13 Thread nsalian
If --jars doesn't work, try --conf spark.executor.extraClassPath=path-to-jar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Executor-java-lang-NoClassDefFoundError-tp24244p24256.html Sent from the Apache Spark User List mailing list archive at

Re: Controlling number of executors on Mesos vs YARN

2015-08-13 Thread Ajay Singal
Tim, The ability to specify fine-grain configuration could be useful for many reasons. Let's take an example of a node with 32 cores. All of first, as per my understanding, having 5 executors each with 6 cores will almost always perform better than having a single executor with 30 cores .

Eviction of RDD persisted on disk

2015-08-13 Thread Eugene Morozov
Hello! I know there is LRU eviction policy for the in memory cached partitions, but unfortunately I cannot find anything regarding evicition policy from rdd presisted on disk. Is there any? Does spark assume that it has infinite disk available and never evict anything? Even then, if it’s

About Databricks's spark-sql-perf

2015-08-13 Thread Todd
Hi, I got a question about the spark-sql-perf project by Databricks at https://github.com/databricks/spark-sql-perf/ The Tables.scala (https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/bigdata/Tables.scala) and BigData

Re: Driver staggering task launch times

2015-08-13 Thread Philip Weaver
Are you running on mesos, yarn or standalone? If you're on mesos, are you using coarse grain or fine grained mode? On Thu, Aug 13, 2015 at 10:13 PM, Ara Vartanian arav...@cs.wisc.edu wrote: I’m observing an unusual situation where my step duration increases as I add further executors to my

RDD.join vs spark SQL join

2015-08-13 Thread Xiao JIANG
Hi,May I know the performance difference the rdd.join function and spark SQL join operation. If I want to join several big Rdds, how should I decide which one I should use? What are the factors to consider here? Thanks!

Re: using Spark or pig group by efficient in my use case?

2015-08-13 Thread Eugene Morozov
I’d say spark will be faster in this case, because it avoids storing intermediate data to disk after map and before reduce tasks. It’ll be faster even if you use Combiner (I’d assume Pig is able to figure that out). Hard to say how much faster as it’ll depend on disks available (ssd vs sshd vs