Re: java.io.IOException: java.lang.reflect.InvocationTargetException on new spark machines

2016-02-26 Thread Abhishek Anand
Any insights on this ? On Fri, Feb 26, 2016 at 1:21 PM, Abhishek Anand wrote: > On changing the default compression codec which is snappy to lzf the > errors are gone !! > > How can I fix this using snappy as the codec ? > > Is there any downside of using lzf as snappy

.cache() changes contents of RDD

2016-02-26 Thread Yan Yang
Hi I am pretty new to Spark, and after experimentation on our pipelines. I ran into this weird issue. The Scala code is as below: val input = sc.newAPIHadoopRDD(...) val rdd = input.map(...) rdd.cache() rdd.saveAsTextFile(...) I found rdd to consist of 80+K identical rows. To be more precise,

RE: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Mohammed Guller
I think you may be referring to Spark Survey 2015. According to that survey, 48% use standalone, 40% use YARN and only 11% use Mesos (the numbers don’t add up to 100 – probably because of rounding error). Mohammed Author: Big Data Analytics with

RE: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Mohammed Guller
Here is another solution (minGraph is the graph from your code. I assume that is your original graph): val graphWithNoOutEdges = minGraph.filter( graph => graph.outerJoinVertices(graph.outDegrees) {(vId, vData, outDegreesOpt) => outDegreesOpt.getOrElse(0)}, vpred = (vId: VertexId,

Configure Spark Resource on AWS CLI Not Working

2016-02-26 Thread Weiwei Zhang
Hi there, I am trying to configure memory for spark using AWS CLI. However, I got the following message: *A client error (ValidationException) occurred when calling the RunJobFlow operation: Cannot specify args for application 'Spark' when release label is used.* In the aws 'create-cluster'

Re: PySpark : couldn't pickle object of type class T

2016-02-26 Thread Anoop Shiralige
Hi Jeff, Thank you for looking into the post. I had explored spark-avro option earlier. Since, we have union of multiple complex data types in our avro schema we couldn't use it. Couple of things I tried. -

RE: Clarification on RDD

2016-02-26 Thread Mohammed Guller
HDFS, as the name implies, is a distributed file system. A file stored on HDFS is already distributed. So if you create an RDD from a HDFS file, the created RDD just points to the file partitions on different nodes. You can read more about HDFS here.

Re: s3 access through proxy

2016-02-26 Thread Gourav Sengupta
Hi, why are you trying to access data in S3 via another network? Does that not cause huge network overhead, and data transmissions losses (as data is getting transferred over internet) and other inconsistencies? Have you tried using AWS CLI? Using "aws s3 sync" command you can copy all the files

SparkML Using Pipeline API locally on driver

2016-02-26 Thread Eugene Morozov
Hi everyone. I have a requirement to run prediction for random forest model locally on a web-service without touching spark at all in some specific cases. I've achieved that with previous mllib API (java 8 syntax): public List> predictLocally(RandomForestModel model,

Re: Spark SQL support for sub-queries

2016-02-26 Thread Mich Talebzadeh
Ok let us try this val d = HiveContext.table("test.dummy") d.registerTempTable("tmp") //Obtain boundary values var minValue : Int = HiveContext.sql("SELECT minRow.id AS minValue FROM (SELECT min(struct(id)) as minRow FROM tmp) AS a").collect.apply(0).getInt(0) var maxValue : Int =

Re: Spark SQL support for sub-queries

2016-02-26 Thread Yin Yang
I tried the following: scala> Seq((2, "a", "test"), (2, "b", "foo")).toDF("id", "a", "b").registerTempTable("test") scala> val df = sql("SELECT maxRow.* FROM (SELECT max(struct(id, b, a)) as maxRow FROM test) a") df: org.apache.spark.sql.DataFrame = [id: int, b: string ... 1 more field] scala>

Re: Spark SQL support for sub-queries

2016-02-26 Thread Michael Armbrust
There will probably be some subquery support in 2.0. That particular query would be more efficient to express as an argmax however. Here is an example in Spark 1.6

Attempting to aggregate multiple values

2016-02-26 Thread Daniel Imberman
Hi all, So over the past few days I've been attempting to create a function that takes an RDD[U], and creates three MMaps. I've been attempting to aggregate these values but I'm running into a major issue. when I initially tried to use separate aggregators for each map, I noticed a significant

Re: DirectFileOutputCommiter

2016-02-26 Thread Alexander Pivovarov
DirectOutputCommitter doc says: The FileOutputCommitter is required for HDFS + speculation, which allows only one writer at a time for a file (so two people racing to write the same file would not work). However, S3 supports multiple writers outputting to the same file, where visibility is

Re: d.filter("id in max(id)")

2016-02-26 Thread Michael Armbrust
You can do max on a struct to get the max value for the first column, along with the values for other columns in the row (an argmax) Here is an example

Re: Spark SQL support for sub-queries

2016-02-26 Thread Mich Talebzadeh
Good stuff I decided to do some boundary value analysis by getting records where the ID (unique value) is IN (min() and max() Unfortanely Hive SQL does not yet support more than one level of sub-query. For example this operation is perfectly valid in Oracle select * from dummy where id IN

Re: DirectFileOutputCommiter

2016-02-26 Thread Reynold Xin
It could lose data in speculation mode, or if any job fails. On Fri, Feb 26, 2016 at 3:45 AM, Igor Berman wrote: > Takeshi, do you know the reason why they wanted to remove this commiter in > SPARK-10063? > the jira has no info inside > as far as I understand the direct

s3 access through proxy

2016-02-26 Thread Joshua Buss
Hello, I'm trying to use spark with google cloud storage, but from a network where I cannot talk to the outside internet directly. This means we have a proxy set up for all requests heading towards GCS. So far, I've had good luck with projects that talk to S3 through libraries like boto (for

Re: Spark 1.5 on Mesos

2016-02-26 Thread Tim Chen
https://spark.apache.org/docs/latest/running-on-mesos.html should be the best source, what problems were you running into? Tim On Fri, Feb 26, 2016 at 11:06 AM, Yin Yang wrote: > Have you read this ? > https://spark.apache.org/docs/latest/running-on-mesos.html > > On Fri,

Spark 1.5 on Mesos

2016-02-26 Thread Ashish Soni
Hi All , Is there any proper documentation as how to run spark on mesos , I am trying from the last few days and not able to make it work. Please help Ashish

Re: Spark 1.5 on Mesos

2016-02-26 Thread Yin Yang
Have you read this ? https://spark.apache.org/docs/latest/running-on-mesos.html On Fri, Feb 26, 2016 at 11:03 AM, Ashish Soni wrote: > Hi All , > > Is there any proper documentation as how to run spark on mesos , I am > trying from the last few days and not able to make

Re: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Tim Chen
Mesos does provide some benefits and features, such as the ability to launch all the Spark pieces in Docker and also Mesos resource scheduling features (weights, roles), and if you plan to also use HDFS/Cassandra there are existing frameworks that are actively maintained by us. That said when

Re: Saving and Loading Dataframes

2016-02-26 Thread Raj Kumar
Thanks for the response Yanbo. Here is the source (it uses the sample_libsvm_data.txt file used in the mlliv examples). -Raj — IOTest.scala - import org.apache.spark.{SparkConf,SparkContext} import org.apache.spark.sql.SQLContext import org.apache.spark.sql.DataFrame object

Clarification on RDD

2016-02-26 Thread Ashok Kumar
Hi, Spark doco says Spark’s primary abstraction is a distributed collection of items called a Resilient Distributed Dataset (RDD). RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs example: val textFile = sc.textFile("README.md") my question is

Re: How does Spark streaming's Kafka direct stream survive from worker node failure?

2016-02-26 Thread yuhang.chenn
Thanks a lot. 发自WPS邮箱客戶端在 Cody Koeninger ,2016年2月27日 上午1:02写道:Yes.On Thu, Feb 25, 2016 at 9:45 PM, yuhang.chenn wrote:Thanks a lot. And I got another question: What would happen if I didn't set "spark.streaming.kafka.maxRatePerPartition"? Will Spark

Re: Hbase in spark

2016-02-26 Thread Ted Yu
I know little about your use case. Did you mean that your data is relatively evenly distributed in Spark domain but showed skew in the bulk load phase ? On Fri, Feb 26, 2016 at 9:02 AM, Renu Yadav wrote: > Hi Ted, > > Thanks for the reply. I am using spark hbase module only

Re: kafka streaming topic partitions vs executors

2016-02-26 Thread Cody Koeninger
Spark in general isn't a good fit if you're trying to make sure that certain tasks only run on certain executors. You can look at overriding getPreferredLocations and increasing the value of spark.locality.wait, but even then, what do you do when an executor fails? On Fri, Feb 26, 2016 at 8:08

Mllib Logistic Regression performance relative to Mahout

2016-02-26 Thread raj.kumar
Hi, We are trying to port over some code that uses Mahout Logistic Regression to Mllib Logistic Regression and our preliminary performance tests indicate a performance bottleneck. It is not clear to me if this is due to one of three factors: o Comparing apples to oranges o Inadequate tuning o

Re: How does Spark streaming's Kafka direct stream survive from worker node failure?

2016-02-26 Thread Cody Koeninger
Yes. On Thu, Feb 25, 2016 at 9:45 PM, yuhang.chenn wrote: > Thanks a lot. > And I got another question: What would happen if I didn't set > "spark.streaming.kafka.maxRatePerPartition"? Will Spark Streamning try to > consume all the messages in Kafka? > > 发自WPS邮箱客戶端 > 在

Re: Hbase in spark

2016-02-26 Thread Ted Yu
In hbase, there is hbase-spark module which supports bulk load. This module is to be backported in the upcoming 1.3.0 release. There is some pending work, such as HBASE-15271 . FYI On Fri, Feb 26, 2016 at 8:50 AM, Renu Yadav wrote: > Has anybody implemented bulk load into

Hbase in spark

2016-02-26 Thread Renu Yadav
Has anybody implemented bulk load into hbase using spark? I need help to optimize its performance. Please help. Thanks & Regards, Renu Yadav

Re: Spark SQL support for sub-queries

2016-02-26 Thread Yin Yang
Since collect is involved, the approach would be slower compared to the SQL Mich gave in his first email. On Fri, Feb 26, 2016 at 1:42 AM, Michał Zieliński < zielinski.mich...@gmail.com> wrote: > You need to collect the value. > > val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0) >

Re: Use maxmind geoip lib to process ip on Spark/Spark Streaming

2016-02-26 Thread Romain Sagean
it seems like some library are missing. I'm not good at compiling and I don't know how to use gradle. But for sbt I use sbt-assembly plugin ( https://github.com/sbt/sbt-assembly) to include all dependency and make a fat jar. For gradle I have found this:

kafka streaming topic partitions vs executors

2016-02-26 Thread patcharee
Hi, I am working a streaming application integrated with Kafka by the API createDirectStream. The application streams a topic which contains 10 partitions (on Kafka). It executes with 10 workers (--num-executors 10) When it reads data from Kafka/ZooKeeper, Spark creates 10 tasks (as same as

Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Guillermo Ortiz
Yes, I am not really happy with that "collect". I was taking a look to use subgraph method and others options and didn't figure out anything easy or direct.. I'm going to try your idea. 2016-02-26 14:16 GMT+01:00 Robin East : > Whilst I can think of other ways to do it I

Java/Spark Library for interacting with Spark API

2016-02-26 Thread Hans van den Bogert
Hi, Does anyone know of a Java/Scala library (not simply a HTTP library) for interacting with Spark through its REST/HTTP API? My “problem” is that interacting through REST induces a lot of work mapping the JSON to sensible Spark/Scala objects. So a simple example, I hope there is a library

Re: Dynamic allocation Spark

2016-02-26 Thread Alvaro Brandon
That was exactly it. I had the worker and master processes of Spark standalone running together with YARN and somehow the resource manager didn't see the nodes. It's working now. Thanks for the tip :-) 2016-02-26 12:33 GMT+01:00 Jeff Zhang : > Check the RM UI to ensure you

Re: Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Robin East
Whilst I can think of other ways to do it I don’t think they would be conceptually or syntactically any simpler. GraphX doesn’t have the concept of built-in vertex properties which would make this simpler - a vertex in GraphX is a Vertex ID (Long) and a bunch of custom attributes that you

Get all vertexes with outDegree equals to 0 with GraphX

2016-02-26 Thread Guillermo Ortiz
I'm new with graphX. I need to get the vertex without out edges.. I guess that it's pretty easy but I did it pretty complicated.. and inefficienct val vertices: RDD[(VertexId, (List[String], List[String]))] = sc.parallelize(Array((1L, (List("a"), List[String]())), (2L, (List("b"),

Re: Bug in DiskBlockManager subDirs logic?

2016-02-26 Thread Igor Berman
I've experienced such kind of outputs when executor was killed(e.g. by OOM killer) or was lost for some reason i.e. try to look at machine if executor wasn't restarted... On 26 February 2016 at 08:37, Takeshi Yamamuro wrote: > Hi, > > Could you make simple codes to

Re: Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Igor Berman
Imho most of production clusters are standalone there was some presentation from spark summit with some stats inside(can't find right now), so standalone was at 1st place it was from Matei https://databricks.com/resources/slides On 26 February 2016 at 13:40, Petr Novak

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Takeshi, do you know the reason why they wanted to remove this commiter in SPARK-10063? the jira has no info inside as far as I understand the direct committer can't be used when either of two is true 1. speculation mode 2. append mode(ie. not creating new version of data but appending to existing

Re: How does Spark streaming's Kafka direct stream survive from worker node failure?

2016-02-26 Thread yuhang.chenn
Thanks a lot. And I got another question: What would happen if I didn't set "spark.streaming.kafka.maxRatePerPartition"? Will Spark Streamning try to consume all the messages in Kafka? 发自WPS邮箱客戶端在 Cody Koeninger ,2016年2月25日 上午11:58写道:The per partition offsets are part of the

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
Alexander, implementation you've attaches supports both modes configured by property " mapred.output.direct." + fs.getClass().getSimpleName() as soon as you see _temporary dir probably the mode is off i.e. the default impl is working and you experiencing some other problem. On 26 February 2016 at

Standalone vs. Mesos for production installation on a smallish cluster

2016-02-26 Thread Petr Novak
Hi all, I believe that it used to be in documentation that Standalone mode is not for production. I'm either wrong or it was already removed. Having a small cluster between 5-10 nodes is Standalone recommended for production? I would like to go with Mesos but the question is if there is real

Re: DirectFileOutputCommiter

2016-02-26 Thread Igor Berman
the performance gain is for commit stage when data is moved from _temporary directory to distination directory since s3 is key-value really the move operation is like copy operation On 26 February 2016 at 08:24, Takeshi Yamamuro wrote: > Hi, > > Great work! > What is the

Re: Dynamic allocation Spark

2016-02-26 Thread Jeff Zhang
Check the RM UI to ensure you have available resources. I suspect it might be that you didn't configure yarn correctly, so NM didn't start properly and you have no resource. On Fri, Feb 26, 2016 at 7:14 PM, alvarobrandon wrote: > Hello everyone: > > I'm trying the

Dynamic allocation Spark

2016-02-26 Thread alvarobrandon
Hello everyone: I'm trying the dynamic allocation in Spark with YARN. I have followed the following configuration steps: 1. Copy the spark-*-yarn-shuffle.jar to the nodemanager classpath. "cp /opt/spark/lib/spark-*-yarn-shuffle.jar /opt/hadoop/share/hadoop/yarn" 2. Added the shuffle service of

Re: No event log in /tmp/spark-events

2016-02-26 Thread Jeff Zhang
If event log is enabled, there should be log like following. But I don't see it in your log. 16/02/26 19:10:01 INFO EventLoggingListener: Logging events to file:///Users/jzhang/Temp/spark-events/local-1456485001491 Could you add "--verbose" in spark-submit command to check whether your

Re: No event log in /tmp/spark-events

2016-02-26 Thread alvarobrandon
Just write /tmp/sparkserverlog without the file part. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/No-event-log-in-tmp-spark-events-tp26318p26343.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: DirectFileOutputCommiter

2016-02-26 Thread Teng Qiu
Hi, thanks :) performance gain is huge, we have a INSERT INTO query, ca. 30GB in JSON format will be written to s3 at the end, without DirectOutputCommitter and our hack in hive and InsertIntoHiveTable.scala, it took more than 40min, with our changes, only 15min then. DirectOutputCommitter works

Is spark.driver.maxResultSize used correctly ?

2016-02-26 Thread Jeff Zhang
My job get this exception very easily even when I set large value of spark.driver.maxResultSize. After checking the spark code, I found spark.driver.maxResultSize is also used in Executor side to decide whether DirectTaskResult/InDirectTaskResult sent. This doesn't make sense to me. Using

Re: Survival Curves using AFT implementation in Spark

2016-02-26 Thread Yanbo Liang
Hi Stuti, AFTSurvivalRegression does not support computing the predicted survival functions/curves currently. I don't know whether the quantile predictions can help you, you can refer the example

Re: Restricting number of cores not resulting in reduction in parallelism

2016-02-26 Thread ankursaxena86
The issue is resolved now. I figured out that I wasn't aware of a hard coding of the spark master parameter as local[4] in the program code which was causing the parallel executions despite me trying to limit cores and executors from command line options. Its a revelation for me that program

Re: Spark SQL support for sub-queries

2016-02-26 Thread Mich Talebzadeh
thanks much appreciated On 26 February 2016 at 09:54, Michał Zieliński wrote: > Spark has a great documentation > > and > guides

Re: Spark SQL support for sub-queries

2016-02-26 Thread Michał Zieliński
Spark has a great documentation and guides : lit and col are here getInt is

Re: Spark SQL support for sub-queries

2016-02-26 Thread Mich Talebzadeh
Thanks Michael. Great d.filter(col("id") === lit(m)).show BTW where all these methods like lit etc are documented. Also I guess any action call like apply(0) or getInt(0) refers to the "current" parameter? Regards On 26 February 2016 at 09:42, Michał Zieliński

Re: Spark SQL support for sub-queries

2016-02-26 Thread Michał Zieliński
You need to collect the value. val m: Int = d.agg(max($"id")).collect.apply(0).getInt(0) d.filter(col("id") === lit(m)) On 26 February 2016 at 09:41, Mich Talebzadeh wrote: > Can this be done using DFs? > > > > scala> val HiveContext = new

Re: select * from mytable where column1 in (select max(column1) from mytable)

2016-02-26 Thread ayan guha
But can't I just use HiveContext and use hive's functionality, which does support subqueries? On Fri, Feb 26, 2016 at 4:28 PM, Mohammad Tariq wrote: > Spark doesn't support subqueries in WHERE clause, IIRC. It supports > subqueries only in the FROM clause as of now. See this

Fwd: Spark SQL support for sub-queries

2016-02-26 Thread Mich Talebzadeh
Can this be done using DFs? scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> val d = HiveContext.table("test.dummy") d: org.apache.spark.sql.DataFrame = [id: int, clustered: int, scattered: int, randomised: int, random_string: string, small_vc: string, padding:

Re: DirectFileOutputCommiter

2016-02-26 Thread Alexander Pivovarov
Amazon uses the following impl https://gist.github.com/apivovarov/bb215f08318318570567 But for some reason Spark show error at the end of the job 16/02/26 08:16:54 INFO scheduler.DAGScheduler: ResultStage 0 (saveAsTextFile at :28) finished in 14.305 s 16/02/26 08:16:54 INFO cluster.YarnScheduler:

Spark SQL support for sub-queries

2016-02-26 Thread Mich Talebzadeh
Can this be done using DFs? scala> val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> val d = HiveContext.table("test.dummy") d: org.apache.spark.sql.DataFrame = [id: int, clustered: int, scattered: int, randomised: int, random_string: string, small_vc: string, padding:

Re: When I merge some datas,can't go on...

2016-02-26 Thread Jeff Zhang
rdd.map(e=>e.split("\\s")).map(e=>(e(0),e(1))).groupByKey() On Fri, Feb 26, 2016 at 3:20 PM, Bonsen wrote: > I have a file,like 1.txt: > 1 2 > 1 3 > 1 4 > 1 5 > 1 6 > 1 7 > 2 4 > 2 5 > 2 7 > 2 9 > > I want to merge them,results like this >