Re: Performance tuning for local mode on one host

2016-07-25 Thread on
There are 4 cores on my system. Running spark with setMaster("local[2]") results in: PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 7 root 20 0 4748836

?????? read parquetfile in spark-sql error

2016-07-25 Thread cj
thank you.I see this sql in the spark doc: http://spark.apache.org/docs/1.6.1/sql-programming-guide.html -- -- ??: "Takeshi Yamamuro";; : 2016??7??26??(??) 6:15 ??: "cj"<124411...@qq.com>; :

Re: spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
thanks a lot Terry 2016-07-26 12:03 GMT+08:00 Terry Hoo : > Kevin, > > Try to create the StreamingContext as following: > > val ssc = new StreamingContext(spark.sparkContext, Seconds(2)) > > > > On Tue, Jul 26, 2016 at 11:25 AM, kevin wrote: > >>

Re: read parquetfile in spark-sql error

2016-07-25 Thread cj
thank you.but I hope to read the parquet file as a table in the spark-sql,not in the java(or scala) program. -- Original -- From: "Kabeer Ahmed";; Date: Mon, Jul 25, 2016 10:36 PM To: "cj"<124411...@qq.com>; Cc:

Re: spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread Terry Hoo
Kevin, Try to create the StreamingContext as following: val ssc = new StreamingContext(spark.sparkContext, Seconds(2)) On Tue, Jul 26, 2016 at 11:25 AM, kevin wrote: > hi,all: > I want to read data from kafka and regist as a table then join a jdbc > table. > My

Re: ORC v/s Parquet for Spark 2.0

2016-07-25 Thread janardhan shetty
Thanks Timur for the explanation. What about if the data is log-data which is delimited(csv or tsv) and doesn't have too many nestings and are in file formats ? On Mon, Jul 25, 2016 at 7:38 PM, Timur Shenkao wrote: > 1) The opinions on StackOverflow are correct, not biased.

spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
hi,all: I want to read data from kafka and regist as a table then join a jdbc table. My sample like this : val spark = SparkSession .builder .config(sparkConf) .getOrCreate() val jdbcDF = spark.read.format("jdbc").options(Map("url" -> "jdbc:mysql://master1:3306/demo",

Re: Maintaining order of pair rdd

2016-07-25 Thread janardhan shetty
groupBy is a shuffle operation and index is already lost in this process if I am not wrong and don't see *sortWith* operation on RDD. Any suggestions or help ? On Mon, Jul 25, 2016 at 12:58 AM, Marco Mistroni wrote: > Hi > after you do a groupBy you should use a sortWith.

UDF returning generic Seq

2016-07-25 Thread Chris Beavers
Hey there, Interested in writing a UDF that returns an ArrayType column of unknown subtype. My understanding is that this translated JVM-type-wise be a Seq of generic templated type: Seq[Any]. I seem to be hitting the constraint at

Re: spark context stop vs close

2016-07-25 Thread Mail.com
Okay. Yes it is JavaSparkContext. Thanks. > On Jul 24, 2016, at 1:31 PM, Sean Owen wrote: > > I think this is about JavaSparkContext which implements the standard > Closeable interface for convenience. Both do exactly the same thing. > >> On Sun, Jul 24, 2016 at 6:27 PM,

ORC v/s Parquet for Spark 2.0

2016-07-25 Thread janardhan shetty
Just wondering advantages and disadvantages to convert data into ORC or Parquet. In the documentation of Spark there are numerous examples of Parquet format. Any strong reasons to chose Parquet over ORC file format ? Also : current data compression is bzip2

Re: Odp.: spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
thanks a lot .after change to scala 2.11 , it works. 2016-07-25 17:40 GMT+08:00 Tomasz Gawęda : > Hi, > > Please change Scala version to 2.11. As far as I know, Spark packages are > now build with Scala 2.11 and I've got other - 2.10 - version > > > >

Spark Web UI port 4040 not working

2016-07-25 Thread Jestin Ma
Hello, when running spark jobs, I can access the master UI (port 8080 one) no problem. However, I'm confused as to how to access the web UI to see jobs/tasks/stages/etc. I can access the master UI at http://:8080. But port 4040 gives me a -connection cannot be reached-. Is the web UI http://

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
Thank you,I can't find spark-streaming-kafka_2.10 jar for spark2 from maven center. so I try the version 1.6.2,it not work ,it need class org.apache.spark.Logging, which can't find in spark2. so I build spark-streaming-kafka_2.10 jar for spark2 from the source code. it's work now. 2016-07-26 2:12

Re: Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Cody Koeninger
This seems really low risk to me. In order to be impacted, it'd have to be someone who was using the kafka integration in spark 2.0, which isn't even officially released yet. On Mon, Jul 25, 2016 at 7:23 PM, Vahid S Hashemian wrote: > Sorry, meant to ask if any Apache

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Yash Sharma
Correction - dataDF.write.partitionBy(“year”, “month”, “date”).mode(SaveMode.Append).text(“s3://data/test2/events/”) On Tue, Jul 26, 2016 at 10:59 AM, Yash Sharma wrote: > Based on the behavior of spark [1], Overwrite mode will delete all your > data when you try to overwrite

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Yash Sharma
Based on the behavior of spark [1], Overwrite mode will delete all your data when you try to overwrite a particular partition. What I did- - Use S3 api to delete all partitions - Use spark df to write in Append mode [2] 1.

Re: Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Vahid S Hashemian
Sorry, meant to ask if any Apache Spark user would be affected. --Vahid From: Vahid S Hashemian/Silicon Valley/IBM@IBMUS To: user@spark.apache.org, d...@spark.apache.org Date: 07/25/2016 05:21 PM Subject:Potential Change in Kafka's Partition Assignment Semantics when

Potential Change in Kafka's Partition Assignment Semantics when Subscription Changes

2016-07-25 Thread Vahid S Hashemian
Hello, We have started a KIP under the Kafka project that proposes a fix for an inconsistency in how partition assignments are currently handled in Kafka when the consumer changes subscription. Note that this applies to new consumer only. The KIP can be found here:

Num of executors and cores

2016-07-25 Thread Mail.com
Hi All, I have a directory which has 12 files. I want to read the entire file so I am reading it as wholeTextFiles(dirpath, numPartitions). I run spark-submit as --num-executors 12 --executor-cores 1 and numPartitions 12. However, when I run the job I see that the stage which reads the

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Pedro Rodriguez
Probably should have been more specific with the code we are using, which is something like val df = df.write.mode("append or overwrite here").partitionBy("date").saveAsTable("my_table") Unless there is something like what I described on the native API, I will probably take the approach of

DAGScheduler: Job 20 finished: collectAsMap at DecisionTree.scala:651, took 19.556700 s Killed

2016-07-25 Thread Ascot Moss
Hi, spark: 1.6.1 java: java 1.8_u40 I tried random forest training phase, the same code works well if with 20 trees (lower accuracy, about 68%). When trying the training phase with more tree, I set to 200 trees, it returned: "DAGScheduler: Job 20 finished: collectAsMap at

Re: Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread RK Aduri
You can have a temporary file to capture the data that you would like to overwrite. And swap that with existing partition that you would want to wipe the data away. Swapping can be done by simple rename of the partition and just repair the table to pick up the new partition. Am not sure if

Spark SQL overwrite/append for partitioned tables

2016-07-25 Thread Pedro Rodriguez
What would be the best way to accomplish the following behavior: 1. There is a table which is partitioned by date 2. Spark job runs on a particular date, we would like it to wipe out all data for that date. This is to make the job idempotent and lets us rerun a job if it failed without fear of

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread ayan guha
STS works on YARN, as a yarn-client application. One issue: STS is not HA-supported, though there was some discussion to make it HA similar to Hive Server. So what we did is to run sts on multiple nodes and tie them to a load balancer. . On Tue, Jul 26, 2016 at 8:06 AM, Mich Talebzadeh

How to partition a SparkDataFrame using all distinct column values in sparkR

2016-07-25 Thread Neil Chang
Hi, This is a question regarding SparkR in spark 2.0. Given that I have a SparkDataFrame and I want to partition it using one column's values. Each value corresponds to a partition, all rows that having the same column value shall go to the same partition, no more no less. Seems the

Re: read parquetfile in spark-sql error

2016-07-25 Thread Takeshi Yamamuro
Hi, Seems your query was not consist with the HQL syntax. you'd better off re-checking the definitions: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-CreateTable // maropu On Mon, Jul 25, 2016 at 11:36 PM, Kabeer Ahmed wrote: >

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Mich Talebzadeh
Correction. STS uses the same UI to display details about all processes running against it which is helpful but gets crowded :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

SPARK UDF related issue

2016-07-25 Thread Carlo . Allocca
Hi All, I am using SPARK 2.0 and I have got the following issue: I am able to run the step 1-5 (see below) but not the step 6 which uses an UDF. Actually, the step 1-5 takes few second and the step 6 looks like that it never ends. Is there anything wrong? how should I address it? Any

Re: Bzip2 to Parquet format

2016-07-25 Thread Takeshi Yamamuro
Hi, This is the expected behaivour. A default compression for parquet is `snappy`. See: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L215 // maropu On Tue, Jul 26, 2016 at 6:33 AM, janardhan shetty

Re: Bzip2 to Parquet format

2016-07-25 Thread janardhan shetty
Andrew, 2.0 I tried val inputR = sc.textFile(file) val inputS = inputR.map(x => x.split("`")) val inputDF = inputS.toDF() inputDF.write.format("parquet").save(result.parquet) Result part files end with *.snappy.parquet *is that expected ? On Sun, Jul 24, 2016 at 8:00 PM, Andrew Ehrlich

Re: Pls assist: Creating Spak EC2 cluster using spark_ec2.py script and a custom AMI

2016-07-25 Thread Mayank Ahuja
Hi Marco, >From AMI name shared, it seems to be HVM image. 'm1' instance family does not support HVM (only PV is supported). Either you can use PV equivalent of this image or you can use 'm3' family (easiest transition from m1 to m3, if possible). Details:

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Mich Talebzadeh
We also should remember that STS is a pretty useful tool. With JDBC you can use beeline, Zeppelin, Squirrel and other tools against it. One thing I like to change is the UI port that the thrift server listens and you can change it at startup using spark.ui.port. This is fixed at thrift startup

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Jacek Laskowski
On Mon, Jul 25, 2016 at 10:57 PM, Mich Talebzadeh wrote: > Yarn promises the best resource management I believe. Having said that I have > not used Mesos myself. I'm glad you've mentioned it. I think Cloudera (and Hortonworks?) guys are doing a great job with

Re: Spark 2.0

2016-07-25 Thread Jacek Laskowski
Hi Bryan, Excellent questions about the upcoming 2.0! Took me a while to find the answer about structured streaming. Seen http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time ? That may be relevant

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Mich Talebzadeh
Hi, Actually I started STS in local mode and that works. I have not tested yarn modes for STS but certainly it appears that one can run these in any mode one wishes. local mode has its limitation (all in one JPS and not taking advantage of scaling out) but one can run STS in local mode on the

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Jacek Laskowski
Hi, That's interesting...What holds STS back from working on the other scheduler backends, e.g. YARN or Mesos? I haven't spent much time with it, but thought it's a mere Spark application. The property is spark.deploy.spreadOut = Whether the standalone cluster manager should spread applications

Check out Kyper! Trying to be Uber of Data

2016-07-25 Thread Daniel Lopes
I just signed up for Kyper and thought you might be interested, too! http://l.aunch.us/L7Ezb

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Krishna Sankar
This intrigued me as well. - Just for sure, I downloaded the 1.6.2 code and recompiled. - spark-shell and pyspark both show 1.6.2 as expected. Cheers On Mon, Jul 25, 2016 at 1:45 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Another possible explanation is that by

Re: Spark 2.0

2016-07-25 Thread Pedro Rodriguez
Spark 2.0 vote for RC5 passed last Friday night so it will probably be released early this week if I had to guess. On Mon, Jul 25, 2016 at 12:23 PM, Bryan Jeffrey wrote: > All, > > I had three questions: > > (1) Is there a timeline for stable Spark 2.0 release? I know

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Mich Talebzadeh
Thanks. As I understand STS only works in Standalone mode :( Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: How to connect HBase and Spark using Python?

2016-07-25 Thread Def_Os
Solved, see: http://stackoverflow.com/questions/38470114/how-to-connect-hbase-and-spark-using-python/38575095 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-connect-HBase-and-Spark-using-Python-tp27372p27409.html Sent from the Apache Spark User List

Re: Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Jacek Laskowski
Hi, My vague understanding of Spark Standalone is that it will take up all available workers for a Spark application (despite the cmd options). There was a property to disable it. Can't remember it now though. Ps. Yet another reason for YARN ;-) Jacek On 25 Jul 2016 6:17 p.m., "Mich

Re: Performance tuning for local mode on one host

2016-07-25 Thread Mich Talebzadeh
Hi On, When you run in Spark mode there is only one SparkSubmit with one executor only. How many cores do you have? Each core will allow the same code to run concurrently so with local{8} you will have 8 tasks running the same code on subset of your data So do cat /proc/cpuinfo|grep

Spark 2.0

2016-07-25 Thread Bryan Jeffrey
All, I had three questions: (1) Is there a timeline for stable Spark 2.0 release? I know the 'preview' build is out there, but was curious what the timeline was for full release. Jira seems to indicate that there should be a release 7/27. (2) For 'continuous' datasets there has been a lot of

Re: Performance tuning for local mode on one host

2016-07-25 Thread on
OK, sorry, I am running in local mode. Just a very small setup... (changed the subject) On 25.07.2016 20:01, Mich Talebzadeh wrote: > Hi, > > From your reference I can see that you are running in local mode with > two cores. But that is not standalone. > > Can you please clarify whether you

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Cody Koeninger
For 2.0, the kafka dstream support is in two separate subprojects depending on which version of Kafka you are using spark-streaming-kafka-0-10 or spark-streaming-kafka-0-8 corresponding to brokers that are version 0.10+ or 0.8+ On Mon, Jul 25, 2016 at 12:29 PM, Reynold Xin

Re: JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
Hi Sean, Thanks for your great help! It works all right if I remove persist!! For next step, I will transform those values before persist. I convert to RDD and back to JavaRDD just for testing purposes. Best Regards, Jia On Mon, Jul 25, 2016 at 1:01 PM, Sean Owen wrote: >

Re: JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Sean Owen
Why are you converting to RDD and back to JavaRDD? The problem is storing references to Writable, which are mutated by the InputFormat. Somewhere you have 1000 refs to the same key. I think it may be the persist. You want to immediately transform these values to something besides a Writable. On

Re: Performance tuning for standalone on one host

2016-07-25 Thread Mich Talebzadeh
Hi, >From your reference I can see that you are running in local mode with two cores. But that is not standalone. Can you please clarify whether you start master and slaves processes. Those are for standalone mode. sbin/start-master.sh sbin/start-slaves.sh HTH Dr Mich Talebzadeh LinkedIn *

JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
My code is as following: System.out.println("Initialize points..."); JavaPairRDD data = sc.sequenceFile(inputFile, IntWritable.class, DoubleArrayWritable.class);

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Andy Davidson
Hi Kevin Just a heads up at the recent spark summit in S.F. There was a presentation about streaming in 2.0. They said that streaming was not going to production ready in 2.0. I am not sure if the older 1.6.x version will be supported. My project will not be able to upgrade with streaming

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Reynold Xin
The presentation at Spark Summit SF was probably referring to Structured Streaming. The existing Spark Streaming (dstream) in Spark 2.0 has the same production stability level as Spark 1.6. There is also Kafka 0.10 support in dstream. On July 25, 2016 at 10:26:49 AM, Andy Davidson (

Performance tuning for standalone on one host

2016-07-25 Thread on
Dear all, I am running spark on one host ("local[2]") doing calculations like this on a socket stream. mainStream = socketStream.filter(lambda msg: msg['header'].startswith('test')).map(lambda x: (x['host'], x) ) s1 = mainStream.updateStateByKey(updateFirst).map(lambda x: (1, x) ) s2 =

Using DirectOutputCommitter with ORC

2016-07-25 Thread Daniel Haviv
Hi, How can the DirectOutputCommitter be utilized for writing ORC files? I tried setting it via: sc.getConf.set("spark.hadoop.mapred.output.committer.class","com.veracity-group.datastage.directorcwriter") But I can still see a _temporary directory being used when I save my dataframe as ORC.

Executors assigned to STS and number of workers in Stand Alone Mode

2016-07-25 Thread Mich Talebzadeh
Hi, I am doing some tests I have started Spark in Standalone mode. For simplicity I am using one node only with 8 works and I have 12 cores In spark-env.sh I set this # Options for the daemons used in the standalone deploy mode export SPARK_WORKER_CORES=1 ##, total number of cores to be used

Re: jdbcRRD and dataframe

2016-07-25 Thread Marco Colombo
Thanks, I would submit an improvement Il lunedì 25 luglio 2016, Mich Talebzadeh ha scritto: > I don't think there is. > > it would be a viable request using collection pool through DF to connect > to an RDBMS > > cheers > > Dr Mich Talebzadeh > > > > LinkedIn * >

get hdfs file path in spark

2016-07-25 Thread Yang Cao
Hi, To be new here, I hope to get assistant from you guys. I wonder whether I have some elegant way to get some directory under some path. For example, I have a path like on hfs /a/b/c/d/e/f, and I am given a/b/c, is there any straight forward way to get the path /a/b/c/d/e . I think I can do

Re: jdbcRRD and dataframe

2016-07-25 Thread Mich Talebzadeh
I don't think there is. it would be a viable request using collection pool through DF to connect to an RDBMS cheers Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: jdbcRRD and dataframe

2016-07-25 Thread Marco Colombo
>From getConnection I'm handling a connection pool. I see no option for that in docs Regards Il lunedì 25 luglio 2016, Mich Talebzadeh ha scritto: > Hi Marco, > > what is in your UDF getConnection and why not use DF itself? > > I guess it is all connection attributes

Re: jdbcRRD and dataframe

2016-07-25 Thread Mich Talebzadeh
Hi Marco, what is in your UDF getConnection and why not use DF itself? I guess it is all connection attributes val c = HiveContext.load("jdbc", Map("url" -> _ORACLEserver, "dbtable" -> "(SELECT to_char(CHANNEL_ID) AS CHANNEL_ID, CHANNEL_DESC FROM sh.channels)", "user" -> _username, "password"

Pls assist: Creating Spak EC2 cluster using spark_ec2.py script and a custom AMI

2016-07-25 Thread Marco Mistroni
HI all i was wondering if anyone can help with this I Have created a spark cluster before using spark_ec2.py script from Spark 1.6.1 that by default uses a very old AMI... so i decided to try to launch the script with a more up to date AMI. the one i have used is ami-d732f0b7, which refers to

Re: read parquetfile in spark-sql error

2016-07-25 Thread Kabeer Ahmed
I hope the below sample helps you: val parquetDF = hiveContext.read.parquet("hdfs://.parquet") parquetDF.registerTempTable("parquetTable") sql("SELECT * FROM parquetTable").collect().foreach(println) Kabeer. Sent from Nylas

jdbcRRD and dataframe

2016-07-25 Thread Marco Colombo
Hi all, I was using JdbcRRD and signature for constructure was accepting a function to get a DB connection. This is very useful to provide my own connection handler. I'm valuating to move to daraframe, but I cannot how to provide such function and migrate my code. I want to use my own

unsubscribe

2016-07-25 Thread milad bourhani

Re: add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread Ndjido Ardo BAR
Hi Pseudo, try this : export SPARK_SUBMIT_OPTIONS = "--jars spark-csv_2.10-1.4.0.jar, commons-csv-1.1.jar" this have been working for me for a longtime ;-) both in Zeppelin(for Spark Scala) and Ipython Notebook (for PySpark). cheers, Ardo On Mon, Jul 25, 2016 at 1:28 PM, pseudo oduesp

Re: unsubscribe)

2016-07-25 Thread Daniel Lopes
Hi Uzi, To unsubscribe e-mail: user-unsubscr...@spark.apache.org *Daniel Lopes* Chief Data and Analytics Officer | OneMatch c: +55 (18) 99764-2733 | https://www.linkedin.com/in/dslopes www.onematch.com.br On Mon, Jul 25,

SPARK SQL and join pipeline issue

2016-07-25 Thread Carlo . Allocca
Dear All, I have the following question: I am using SPARK SQL 2.0 version and, in particular I am doing some joins in pipeline of the following pattern (d3 = d1 join d2, d4=d5 join d6, d7=d3 join d4). When running my code, I realised that the building of d7 generates an issue as reported

Re: add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp
PYSPARK_SUBMIT_ARGS = --jars spark-csv_2.10-1.4.0.jar,commons-csv-1.1.jar without succecs thanks 2016-07-25 13:27 GMT+02:00 pseudo oduesp : > Hi , > someone can telle me how i can add jars to ipython i try spark > > >

add spark-csv jar to ipython notbook without packages flags

2016-07-25 Thread pseudo oduesp
Hi , someone can telle me how i can add jars to ipython i try spark

Re: Spark, Scala, and DNA sequencing

2016-07-25 Thread James McCabe
Thanks Ofir and Sean, I'm aware of AMPLab's ADAM. Spark is bringing down the cost of genome analysis to consumer level. The potential for the future of medicine is exciting indeed. I normally do Scala consulting which keeps me too busy, but recently I finally got some spare time to look

read parquetfile in spark-sql error

2016-07-25 Thread cj
hi,all: I use spark1.6.1 as my work env. when I saved the following content as test1.sql file : CREATE TEMPORARY TABLE parquetTableUSING org.apache.spark.sql.parquet OPTIONS ( path "examples/src/main/resources/people.parquet" ) SELECT * FROM parquetTable and use

Fwd: PySpark : Filter based on resultant query without additional dataframe

2016-07-25 Thread kiran kumar
Hi All, I am trying out with spark for the first time, so am reaching out for what would seem as very basic question. Consider the below example >>> l = >>> [("US","City1",125),("US","City2",123),("Europe","CityX",23),("Europe","CityY",17)] >>> print l [('US', 'City1', 125), ('US', 'City2',

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-25 Thread Rohit Chaddha
Hi Krishna, Great .. I had no idea about this. I tried your suggestion by using na.drop() and got a rmse = 1.5794048211812495 Any suggestions how this can be reduced and the model improved ? Regards, Rohit On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar wrote: > Thanks

spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
hi,all: I download spark2.0 per-build. I can run SqlNetworkWordCount test use : bin/run-example org.apache.spark.examples.streaming.SqlNetworkWordCount master1 but when I use spark2.0 example source code SqlNetworkWordCount.scala and build it to a jar bao with dependencies ( JDK 1.8 AND

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread Marco Mistroni
Hi Kevin you should not need to rebuild everything. Instead, i believe you should launch spark-submit by specifying the kafka jar file in your --packages... i had to follow same when integrating spark streaming with flume have you checked this link ?

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
I have compile it from source code 2016-07-25 12:05 GMT+08:00 kevin : > hi,all : > I try to run example org.apache.spark.examples.streaming.KafkaWordCount , > I got error : > Exception in thread "main" java.lang.NoClassDefFoundError: >

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-25 Thread Daniel Darabos
Another possible explanation is that by accident you are still running Spark 1.6.1. Which download are you using? This is what I see: $ ~/spark-1.6.2-bin-hadoop2.6/bin/spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN

Re: Maintaining order of pair rdd

2016-07-25 Thread Marco Mistroni
Hi after you do a groupBy you should use a sortWith. Basically , a groupBy reduces your structure to (anyone correct me if i m wrong) a RDD[(key,val)], which you can see as a tuple.so you could use sortWith (or sortBy, cannot remember which one) (tpl=> tpl._1) hth On Mon, Jul 25, 2016 at

Re: Hive and distributed sql engine

2016-07-25 Thread Marco Colombo
Thanks. That what I was thinking. But how to setup connection per worker? Il lunedì 25 luglio 2016, ayan guha ha scritto: > In order to use existing pg UDF, you may create a view in pg and expose > the view to hive. > Spark to database connection happens from each

Re: Hive and distributed sql engine

2016-07-25 Thread ayan guha
In order to use existing pg UDF, you may create a view in pg and expose the view to hive. Spark to database connection happens from each executors, so you must have a connection or a pool of connection per worker. Executors of the same worker can share connection pool. Best Ayan On 25 Jul 2016

Re: java.lang.RuntimeException: Unsupported type: vector

2016-07-25 Thread Hyukjin Kwon
I just wonder how your CSV data structure looks like. If my understanding is correct, is SQL type of the VectorUDT is StructType and CSV data source does not support ArrayType and StructType. Anyhow, it seems CSV does not support UDT for now anyway.

Hive and distributed sql engine

2016-07-25 Thread Marco Colombo
Hi all! Among other use cases, I want to use spark as a distributed sql engine via thrift server. I have some tables in postegres and Cassandra: I need to expose them via hive for custom reporting. Basic implementation is simple and works, but I have some concerns and open question: - is there a