YARN - Pyspark

2016-09-29 Thread ayan guha
Hi I just observed a litlte weird behavior: I ran a pyspark job, very simple one. conf = SparkConf() conf.setAppName("Historical Meter Load") conf.set("spark.yarn.queue","root.Applications") conf.set("spark.executor.instances","50") conf.set("spark.executor.memory","10g") conf.set("spark.yarn.ex

Re: spark listener do not get fail status

2016-09-29 Thread Aseem Bansal
Hi In case my previous email was lacking in details here are some more details. - using Spark 2.0.0 - launching the job using org.apache.spark.launcher.SparkLauncher.startApplication(myListener) - checking state in the listener's stateChanged method On Thu, Sep 29, 2016 at 5:24 PM, Aseem Bansal

Re: Spark ML Decision Trees Algorithm

2016-09-29 Thread janardhan shetty
Hi, Any help here is appreciated .. On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty wrote: > Is there a reference to the research paper which is implemented in spark > 2.0 ? > > On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty > wrote: > >> Which algorithm is used under the covers while do

FetchFailed exception with Spark 1.6

2016-09-29 Thread Ankur Srivastava
Hi, I am running a simple job on Spark 1.6 in which I am trying to leftOuterJoin a big RDD with a smaller one. I am not yet broadcasting the smaller RDD yet but I am stilling running into FetchFailed errors with finally the job getting killed. I have already partitioned the data to 5000 partition

Issues in compiling spark 2.0.0 code using scala-maven-plugin

2016-09-29 Thread satyajit vegesna
Hi ALL, i am trying to compile code using maven ,which was working with spark 1.6.2, but when i try for spark 2.0.0 then i get below error, org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile (default) on project Nginx

Re: Is there a way to get the AUC metric for CrossValidator?

2016-09-29 Thread Rich Tarro
According to the documentation, cvModel.avgMetrics gives average cross-validation metrics for each paramMap in CrossValidator.estimatorParamMaps, in the corresponding order. So when using areaUnderROC as the evaluator, cvModel.avgMetrics gives (in this example using scala, but API appears to work

How to extract bestModel parameters from a CrossValidatorModel

2016-09-29 Thread Rich Tarro
I'm able to successfully extract parameters from a PipelineModel using model.stages. However, when I try to extract parameters from the bestModel of a CrossValidatorModel using cvModel.bestModel.stages, I get this error. error: value stages is not a member of org.apache.spark.ml.Model[_$4] cvMode

Re: spark sql on json

2016-09-29 Thread Hyukjin Kwon
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java#L104-L181 2016-09-29 18:58 GMT+09:00 Hitesh Goyal : > Hi team, > > > > I have a json document. I want to put spark SQL to it. > > Can you please send me an example app built i

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-09-29 Thread Takeshi Yamamuro
Hi, FYI: Seems `sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”)` is only available at hadoop-2.7.3+. // maropu On Thu, Sep 29, 2016 at 9:28 PM, joffe.tal wrote: > You can use partition explicitly by adding "/=" > to > the end of the path you are writing to a

Re: Spark 2.0 issue

2016-09-29 Thread Xiao Li
Hi, Ashish, Will take a look at this soon. Thanks for reporting this, Xiao 2016-09-29 14:26 GMT-07:00 Ashish Shrowty : > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > > // reading f

Running Spark master/slave instances in non Daemon mode

2016-09-29 Thread jpuro
Hi, I recently tried deploying Spark master and slave instances to container based environments such as Docker, Nomad etc. There are two issues that I've found with how the startup scripts work. The sbin/start-master.sh and sbin/start-slave.sh start a daemon by default, but this isn't as compatibl

Running Spark master/slave instances in non Daemon mode

2016-09-29 Thread Jeff Puro
Hi, I recently tried deploying Spark master and slave instances to container based environments such as Docker, Nomad etc. There are two issues that I've found with how the startup scripts work. The sbin/start-master.sh and sbin/start-slave.sh start a daemon by default, but this isn't as compatibl

Spark 2.0 issue

2016-09-29 Thread Ashish Shrowty
If I try to inner-join two dataframes which originated from the same initial dataframe that was loaded using spark.sql() call, it results in an error - // reading from Hive .. the data is stored in Parquet format in Amazon S3 val d1 = spark.sql("select * from ") val df1 = d1.groupBy("

Re: pyspark ML example not working

2016-09-29 Thread William Kupersanin
Was there an answer to this? I get this periodically when a job has died from an error and I run another job. I have gotten around it by going to /var/lib/hive/metastore/metastore_db and removing the *.lck files. I am sure this is the exact wrong thing to do as I imagine those lock files exist to p

Setting conf options in jupyter

2016-09-29 Thread William Kupersanin
Hello, I am trying to figure out how to correctly set config options in jupyter when I am already provided a SparkContext and a HiveContext. I need to increase a couple of memory allocations. My program dies indicating that I am trying to call methods on a stopped SparkContext. I thought I had cre

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be running YARN if they are considering spark. (Assuming that you’re not trying to do a storage / compute model and use standalone spark outside your cluster. You can, but you have more moving parts…) I never said anythin

writing to s3 failing to move parquet files from temporary folder

2016-09-29 Thread jamborta
Hi, I have an 8 hour job (spark 2.0.0) that writes the results out to parquet using the standard approach: processed_images_df.write.format("parquet").save(s3_output_path) It executes 1 tasks and writes the results to a _temporary folder, and in the last step (after all the tasks completed)

Re: Questions about DataFrame's filter()

2016-09-29 Thread Michael Armbrust
-dev +user It surprises me as `filter()` takes a Column, not a `Row => Boolean`. There are several overloaded versions of Dataset.filter(...) def filter(func: FilterFunction[T]): Dataset[T] def filter(func: (T) ⇒ Boolean): Dataset[T] def filter(conditionExpr: String): Dataset[T] def filter(cond

Is there a way to get the AUC metric for CrossValidator?

2016-09-29 Thread evanzamir
I'm using CrossValidator (in PySpark) to create a logistic regression model. There is "areaUnderROC", which I assume gives the AUC for the bestModel chosen by CV. But how to get the areaUnderROC for the test data during the cross-validation? -- View this message in context: http://apache-spark

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
The OP didn't say anything about Yarn, and why are you contemplating putting Kafka or Spark on public networks to begin with? Gwen's right, absent any actual requirements this is kind of pointless. On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel wrote: > Spark standalone is not Yarn… or secure fo

Running in local mode as SQL engine - what to optimize?

2016-09-29 Thread RodrigoB
Hi all, For several reasons which I won't elaborate (yet), we're using Spark on local mode as an in memory SQL engine for data we're retrieving from Cassandra, execute SQL queries and return to the client - so no cluster, no worker nodes. I'm well aware local mode has always been considered a test

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Spark standalone is not Yarn… or secure for that matter… ;-) > On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: > > Spark streaming helps with aggregation because > > A. raw kafka consumers have no built in framework for shuffling > amongst nodes, short of writing into an intermediate topic

Metrics System not recognizing Custom Source/Sink in application jar

2016-09-29 Thread map reduced
Hi, I've added Custom Source and Sink in my application jar and found a way to get a static fixed metrics.properties on Stand-alone cluster nodes. When I want to launch my application, I give the static path - spark.metrics.conf="/fixed-path/to/metrics.properties". Despite my custom source/sink be

Re: Re: Selecting the top 100 records per group by?

2016-09-29 Thread Mariano Semelman
It's not Spark specific, but it answers your question: https://blog.jooq.org/2014/08/12/the-difference-between-row_number-rank-and-dense_rank/ On 12 September 2016 at 12:42, Mich Talebzadeh wrote: > Hi, > > I don't understand why you need to add a column row_number when you can > use rank or den

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
df: - a|b|c --- 1|m|n 1|x | j 2|m|x ... import pyspark.sql.functions as F from pyspark.sql.types import MapType, StringType def my_zip(c, d): return dict(zip(c, d)) my_zip = F.udf(_my_zip, MapType(StingType(), StringType(), True), True) df.groupBy('a').agg(my_zip(collect_list

Pyspark - 1.5.0 pickle ML PipelineModel

2016-09-29 Thread Simone
Hi all, I am trying to save a trained ML pipeline model with pyspark 1.5. I am aware there is no .save method till 1.6 and that the workaround that should work is to serialize the PipelineModel object. This works in scala/java, but it seems like I cannot pickle the trained model in Python. Ha

Re: Question about executor memory setting

2016-09-29 Thread mohan s
Hi Kindly go through the below link. It explains good way about spark memory allocations. https://www.slideshare.net/cloudera/top-5-mistakes-to-avoid-when-writing-apache-spark-applications?from_m_app=ios Regards Mohan.s > On 28-Sep-2016, at 7:57 AM, Dogtail L wrote: > > Hi all, > > May I a

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Michael, How about druid here. Hive ORC tables are another option that have Streaming data ingest to Flume and storm However, Spark cannot read ORC transactional tables because of delta files, unless t

Re: Submit and Monitor standalone cluster application

2016-09-29 Thread Mariano Semelman
Sorry, my mistake (quick copy-paste), livy doesn't let me submit applications the classic way (with assembly jars) and force me to change all my current applications. -- *Mariano Semelman* P13N - IT Av. Corrientes Nº 746 - piso 13 - C.A.B.A. (C1043AAU) Teléfono (54) 11

spark streaming minimum batch interval

2016-09-29 Thread Shushant Arora
Hi I want to enquire does spark streaming has some limitation of 500ms of batch intreval ? Is storm better than spark streaming for real time (for latency of just 50-100ms). In spark streaming can parallel batches be run ? If yes is it supported at productionlevel. Thanks

Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
Thanks Michael. I realised that just checking for Volume > 0 should do val rs = df2.filter($"Volume".cast("Integer") > 0) will do, Your point on Again why not remove the rows where the volume of trades is 0? Are you referring to below scala> val rs = df2.filter($"Volume".cast("Integer") ===

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
> I still don't understand why writing to a transactional database with locking > and concurrency (read and writes) through JDBC will be fast for this sort of > data ingestion. Who cares about fast if your data is wrong? And it's still plenty fast enough https://youtu.be/NVl9_6J1G60?list=WL&t=

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally decide

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
The way I see this, there are two things involved. 1. Data ingestion through source to Kafka 2. Date conversion and Storage ETL/ELT 3. Presentation Item 2 is the one that needs to be designed correctly. I presume raw data has to confirm to some form of MDM that requires schema mapping e

Fwd: tod...@yahoo-inc.com is no longer with Yahoo! (was: Re: Treadting NaN fields in Spark)

2016-09-29 Thread Michael Segel
Hi, Hate to be a pain… but could someone remove this email address (see below) from the spark mailing list(s) It seems that ‘Elvis’ has left the building and forgot to change his mail subscriptions… Begin forwarded message: From: Yahoo! No Reply mailto:postmas...@yahoo-inc.com>> Subject: tod..

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work. The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does Y

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
On Sep 29, 2016, at 10:29 AM, Mich Talebzadeh mailto:mich.talebza...@gmail.com>> wrote: Good points :) it took take "-" as a negative number -123456? Yeah… you have to go down a level and start to remember that you’re dealing with a stream or buffer of bytes below any casting. At this moment

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The business use case is to read a user's data from a variety of different services through their API, and then allowing the user to query that data, on a per service basis, as well as an aggregate across all services. The way I'm considering doing it, is to do some basic ETL (drop all the unneces

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
No, direct stream in and of itself won't ensure an end-to-end guarantee, because it doesn't know anything about your output actions. You still need to do some work. The point is having easy access to offsets for batches on a per-partition basis makes it easier to do that work, especially in conju

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Ali, What is the business use case for this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaim

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
If you use spark direct streams , it ensure end to end guarantee for messages. On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar wrote: > My concern with Postgres / Cassandra is only scalability. I will look > further into Postgres horizontal scaling, thanks. > > Writes could be idempotent if done as

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Yes but still these writes from Spark have to go through JDBC? Correct. Having said that I don't see how doing this through Spark streaming to postgress is going to be faster than source -> Kafka - flume via zookeeper -> HDFS. I believe there is direct streaming from Kakfa to Hive as well and fr

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
If you're doing any kind of pre-aggregation during ETL, spark direct stream will let you more easily get the delivery semantics you need, especially if you're using a transactional data store. If you're literally just copying individual uniquely keyed items from kafka to a key-value store, use kaf

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
My concern with Postgres / Cassandra is only scalability. I will look further into Postgres horizontal scaling, thanks. Writes could be idempotent if done as upserts, otherwise updates will be idempotent but not inserts. Data should not be lost. The system should be as fault tolerant as possible.

Re: Treadting NaN fields in Spark

2016-09-29 Thread Mich Talebzadeh
Good points :) it took take "-" as a negative number -123456? At this moment in time this is what the code does 1. csv is imported into HDFS as is. No cleaning done for rogue columns done at shell level 2. Spark programs does the following filtration: 3. val rs = df2.filter($"Open" !

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
I wouldn't give up the flexibility and maturity of a relational database, unless you have a very specific use case. I'm not trashing cassandra, I've used cassandra, but if all I know is that you're doing analytics, I wouldn't want to give up the ability to easily do ad-hoc aggregations without a l

Re: Treadting NaN fields in Spark

2016-09-29 Thread Peter Figliozzi
"isnan" ends up using a case class, subclass of UnaryExpression, called "IsNaN" which evaluates each row of the column like this: - *False* if the value is Null - Check the "Expression.Type" (apparently a Spark thing, not a Scala thing.. still learning here) - DoubleType: cast to Doub

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Hi Cody Spark direct stream is just fine for this use case. But why postgres and not cassandra? Is there anything specific here that i may not be aware? Thanks Deepak On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote: > How are you going to handle etl failures? Do you care about lost / > d

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04 A

Re: udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
btw, i am using spark 1.6.1 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/udf-of-aggregation-in-pyspark-dataframe-tp27811p27812.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Is there an advantage to that vs directly consuming from Kafka? Nothing is being done to the data except some light ETL and then storing it in Cassandra On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma wrote: > Its better you use spark's direct stream to ingest from kafka. > > On Thu, Sep 29, 2016

udf of aggregation in pyspark dataframe ?

2016-09-29 Thread peng yu
Hi, is there a way to write a udf in pyspark support agg()? i search all over the docs and internet, and tested it out.. some say yes, some say no. and when i try those yes code examples, just complaint about AnalysisException: u"expression 'pythonUDF' is neither present in the group by, nor

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Its better you use spark's direct stream to ingest from kafka. On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar wrote: > I don't think I need a different speed storage and batch storage. Just > taking in raw data from Kafka, standardizing, and storing it somewhere > where the web UI can query it, see

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I don't think I need a different speed storage and batch storage. Just taking in raw data from Kafka, standardizing, and storing it somewhere where the web UI can query it, seems like it will be enough. I'm thinking about: - Reading data from Kafka via Spark Streaming - Standardizing, then storin

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Since the inflow is huge , flume would also need to be run with multiple channels in distributed fashion. In that case , the resource utilization will be high in that case as well. Thanks Deepak On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh wrote: > - Spark Streaming to read data from Kafka

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
- Spark Streaming to read data from Kafka - Storing the data on HDFS using Flume You don't need Spark streaming to read data from Kafka and store on HDFS. It is a waste of resources. Couple Flume to use Kafka as source and HDFS as sink directly KafkaAgent.sources = kafka-sources KafkaAgent.sinks

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
For ui , you need DB such as Cassandra that is designed to work around queries . Ingest the data to spark streaming (speed layer) and write to hdfs(for batch layer). Now you have data at rest as well as in motion(real time). >From spark streaming itself , do further processing and write the final r

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?" Dont do that. I would recommend that spark streaming process stores data into some nosql or sql database and the web ui to query data from that database. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The web UI is actually the speed layer, it needs to be able to query the data online, and show the results in real-time. It also needs a custom front-end, so a system like Tableau can't be used, it must have a custom backend + front-end. Thanks for the recommendation of Flume. Do you think this w

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
You need a batch layer and a speed layer. Data from Kafka can be stored on HDFS using flume. - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports) This is basically batch layer and you need something like Tab

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
It needs to be able to scale to a very large amount of data, yes. On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma wrote: > What is the message inflow ? > If it's really high , definitely spark will be of great use . > > Thanks > Deepak > > On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > >> I have a

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and writing their raw > data

configure spark with openblas, thanks

2016-09-29 Thread TheGeorge1918 .
Hi all, I’m trying to properly configure OpenBlas in spark ml. I use centos7, hadoop2.7.2, spark2.0 and python2.7. (I use pyspark to build ml pipeline) At first I have following warnings *WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS* *WARN BLAS: Fa

Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I have a somewhat tricky use case, and I'm looking for ideas. I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka. I need to: - Do ETL on the data, and standardize it. - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch / Pos

RE: building runnable distribution from source

2016-09-29 Thread Mendelson, Assaf
Thanks, that solved it. If there is a developer here, it would be useful if this error would be marked as error instead of INFO (especially since this causes core to fail instead of an R package). Thanks, Assaf. -Original Message- From: Ding Fei [mailto:ding...@stars.org.cn] Sen

mapWithState() without data checkpointing

2016-09-29 Thread Alexey Kharlamov
Hello! I would like to avoid data checkpointing when processing a DStream. Basically, we do not care if the intermediate data are lost. Is there a way to achieve that? Is there an extension point or class embedding all associated activities? Thanks! Sincerely yours, — Alexey Kharlamov --

Re: Treadting NaN fields in Spark

2016-09-29 Thread Michael Segel
Hi, Just a few thoughts so take it for what its worth… Databases have static schemas and will reject a row’s column on insert. In your case… you have one data set where you have a column which is supposed to be a number but you have it as a string. You want to convert this to a double in your f

Re: S3 DirectParquetOutputCommitter + PartitionBy + SaveMode.Append

2016-09-29 Thread joffe.tal
You can use partition explicitly by adding "/=" to the end of the path you are writing to and then use overwrite. BTW in Spark 2.0 you just need to use: sc.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version","2”) and use s3a:// and you can work with regular output committer

Re: Spark Hive Rejection

2016-09-29 Thread Michael Segel
Correct me if I’m wrong but isn’t hive schema on read and not on write? So you shouldn’t fail on write. On Sep 29, 2016, at 1:25 AM, Mostafa Alaa Mohamed mailto:mohamedamost...@etisalat.ae>> wrote: Dears, I want to ask • What will happened if there are rejections rows when inserting da

Re: building runnable distribution from source

2016-09-29 Thread Michael Segel
You may want to replace the 2.4 with a later release. On Sep 29, 2016, at 3:08 AM, AssafMendelson mailto:assaf.mendel...@rsa.com>> wrote: Hi, I am trying to compile the latest branch of spark in order to try out some code I wanted to contribute. I was looking at the instructions to build from

spark listener do not get fail status

2016-09-29 Thread Aseem Bansal
Hi Submitting job via spark api but I never get fail status even when the job throws an exception or exit via System.exit(-1) How do I indicate via SparkListener API that my job failed?

Re: Large-scale matrix inverse in Spark

2016-09-29 Thread Robineast
The paper you mention references a Spark-based LU decomposition approach. AFAIK there is no current implementation in Spark but there is a JIRA open (https://issues.apache.org/jira/browse/SPARK-8514 ) that covers this - seems to have gone quiet

Re: building runnable distribution from source

2016-09-29 Thread Ding Fei
Check that your R is properly installed: >Cannot find 'R_HOME'. Please specify 'R_HOME' or make sure R is properly >installed. On Thu, 2016-09-29 at 01:08 -0700, AssafMendelson wrote: > Hi, > > I am trying to compile the latest branch of spark in order to try out > some code I wanted to contr

spark sql on json

2016-09-29 Thread Hitesh Goyal
Hi team, I have a json document. I want to put spark SQL to it. Can you please send me an example app built in JAVA so that I would be able to put spark sql queries on my data. Regards, Hitesh Goyal Simpli5d Technologies Cont No.: 9996588220

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Sean Owen
No, I think that's what dependencyManagent (or equivalent) is definitely for. On Thu, Sep 29, 2016 at 5:37 AM, Olivier Girardot wrote: > I know that the code itself would not be the same, but it would be useful to > at least have the pom/build.sbt transitive dependencies different when > fetching

Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-29 Thread Olivier Girardot
I know that the code itself would not be the same, but it would be useful to at least have the pom/build.sbt transitive dependencies different when fetching the artifact with a specific classifier, don't you think ?For now I've overriden them myself using the dependency versions defined in the pom.

Re: spark persistence doubt

2016-09-29 Thread Bedrytski Aliaksandr
Hi, the 4th step should contain "transformrdd2", right? considering that transformations are lined-up and executed only when there is an action (also known as lazy execution), I would say that adding persist() to the step 1 would not do any good (and may even be harmful as you may lose the optimi

building runnable distribution from source

2016-09-29 Thread AssafMendelson
Hi, I am trying to compile the latest branch of spark in order to try out some code I wanted to contribute. I was looking at the instructions to build from http://spark.apache.org/docs/latest/building-spark.html So at first I did: ./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTest