RE: How to add MaxDOP option in spark mssql JDBC

2024-04-24 Thread Appel, Kevin
You might be able to leverage the prepareQuery option, that is at https://spark.apache.org/docs/3.5.1/sql-data-sources-jdbc.html#data-source-option ... this was introduced in Spark 3.4.0 to handle temp table query and CTE query against MSSQL server since what you send in is not actually what

Unsubscribe

2023-07-27 Thread Kevin Wang
Unsubscribe please!

RE: determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
-03-30| 2| | 4|2014-03-31| 3| | 5|2015-03-07| 7| | 6|2015-03-08| 1| | 7|2015-03-30| 2| | 8|2015-03-31| 3| +---+--++ From: Appel, Kevin Sent: Friday, February 11, 2022 2:35 PM To: user@spark.apache.org; 'Sean Owen

determine week of month from date in spark3

2022-02-11 Thread Appel, Kevin
aveats or items to be aware of that might get us later? For example in a future Spark 3.3.X is this option going to be deprecated This was an item that we ran into from Spark2 to Spark3 conversion and trying to see how to best handle this Thanks for your feedback, Kevin

How to run spark benchmark on standalone cluster?

2021-07-02 Thread Kevin Su
Hi all, I want to run spark benchmark on a standalone cluster, and I have changed the DataSourceReadBenchmark.scala setting. (Remove "spark.master") --- a/sql/core/src/test /scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala +++ b/sql/core/src/test

Fwd: Fail to run benchmark in Github Action

2021-06-26 Thread Kevin Su
-- Forwarded message - 寄件者: Kevin Su Date: 2021年6月25日 週五 下午8:23 Subject: Fail to run benchmark in Github Action To: Hi all, I try to run a benchmark test in GitHub action in my fork, and I faced the below error. https://github.com/pingsutw/spark/runs/2867617238

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
big table that will be joined. * I think frequent I/O actions like select may cause memery or i/o issues. 2. You can use postgreSql connection pools to avoid making connnection frequently. -- Best, Kevin Chen From: Geervan Hayatnagarkar Date: Sunday, November 29, 2020 at 6:20 PM

Re: Stream-static join : Refreshing subset of static data / Connection pooling

2020-11-29 Thread chen kevin
Hi, you can use Debezium to capture real-timely the row-level changes in PostgreSql, then stream them to kafka, finally etl and write the data to hbase by flink/spark streaming。So you can join the data in hbase directly. in consideration of the particularly big table, the scan performance

Re: how to manage HBase connections in Executors of Spark Streaming ?

2020-11-25 Thread chen kevin
1. the issue about that Kerberos expires. * You don’t need to care aboubt usually, you can use the local keytab at every node in the Hadoop cluster. * If there don’t have the keytab in your Hadoop cluster, you will need update your keytab in every executor periodically。 2.

Re: Using two WriteStreams in same spark structured streaming job

2020-11-08 Thread Kevin Pis
h function, then I may need to use custom Kafka stream > writer > right ?! > > And I might not be able to use default writestream.format(Kafka) method ?! > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best, Kevin Pis

Re: Spark streaming with Kafka

2020-11-03 Thread Kevin Pis
t; > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Best, Kevin Pis

Re: spark-submit parameters about two keytab files to yarn and kafka

2020-11-01 Thread kevin chen
g to SASL_PLAINTEXT, if your spark version is 1.6. *note:* my test env: spark 2.0.2 kafka 0.10 references 1. using-spark-streaming <https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.0/bk_spark-component-guide/content/using-spark-streaming.html> -- Best, Kevin Pis Gabor Somogyi 于2

Re: [Spark Core] Vectorizing very high-dimensional data sourced in long format

2020-11-01 Thread kevin chen
Perhaps it can avoid errors(exhausting executor and driver memory) to add random numbers to the entity_id column when you solve the issue by Patrick's way. Daniel Chalef 于2020年10月31日周六 上午12:42写道: > Yes, the resulting matrix would be sparse. Thanks for the suggestion. Will > explore ways of

Re: Exception when reading multiline JSON file

2019-09-12 Thread Kevin Mellott
this easier to troubleshoot because you can execute the Spark code one step at a time using their visual notebook experience. Hope that helps point you in the right direction. https://spark.apache.org/docs/latest/monitoring.html https://m.youtube.com/watch?v=KscZf1y97m8 Kevin On Thu, Sep 12, 2019 at 12

Re: How to sleep Spark job

2019-01-22 Thread Kevin Mellott
I’d recommend using a scheduler of some kind to trigger your job each hour, and have the Spark job exit when it completes. Spark is not meant to run in any type of “sleep mode”, unless you want to run a structured streaming job and create a separate process to pull data from Casandra and publish

spark jdbc postgres query results don't match those of postgres query

2018-03-29 Thread Kevin Peng
I am running into a weird issue in Spark 1.6, which I was wondering if anyone has encountered before. I am running a simple select query from spark using a jdbc connection to postgres: val POSTGRES_DRIVER: String = "org.postgresql.Driver" val srcSql = """select total_action_value, last_updated

NullPointerException issue in LDA.train()

2018-02-09 Thread Kevin Lam
heavily followed the code outlined here: http://sean.lane.sh/blog/2016/PySpark_and_LDA Any ideas or help is appreciated!! Thanks in advance, Kevin Example trace of output: 16:22:55 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 8.0 in >> stage 42.0 (TID 16163, >> royal

How to preserve the order of parquet files?

2018-02-07 Thread Kevin Jung
Hi all, In spark 2.2.1, when I load parquet files, it shows differently ordered result of original dataset. It seems like FileSourceScanExec.createNonBucketedReadRDD method sorts parquet file splits by their own lengths. - val splitFiles = selectedPartitions.flatMap { partition =>

[Spark ML] LogisticRegressionWithSGD

2017-06-29 Thread Kevin Quinn
Hello, I'd like to build a system that leverages semi-online updates and I wanted to use stochastic gradient descent. However, after looking at the documentation it looks like that method is deprecated. Is there a reason why it was deprecated? Is there a planned replacement? As far as I know

Re: Any NLP library for sentiment analysis in Spark?

2017-04-11 Thread Kevin Wang
I am also interested in this topic. Anything else anyone can recommend? Thanks. Best, Kevin On Tue, Apr 11, 2017 at 5:00 AM, Alonso Isidoro Roman <alons...@gmail.com> wrote: > i did not use it yet, but this library looks promising: > > https://github.com/databricks/spark-coren

Re: Aggregated column name

2017-03-23 Thread Kevin Mellott
.agg(count("number"))*.alias("ColumnNameCount")* Hope that helps! Kevin On Thu, Mar 23, 2017 at 2:41 AM, Wen Pei Yu <yuw...@cn.ibm.com> wrote: > Hi All > > I found some spark version(spark 1.4) return upper case aggregated > column, and some return low cas

Re: Setting Optimal Number of Spark Executor Instances

2017-03-15 Thread Kevin Peng
Mohini, We set that parameter before we went and played with the number of executors and that didn't seem to help at all. Thanks, KP On Tue, Mar 14, 2017 at 3:37 PM, mohini kalamkar wrote: > Hi, > > try using this parameter --conf spark.sql.shuffle.partitions=1000

Re: pivot over non numerical data

2017-02-01 Thread Kevin Mellott
This should work for non-numerical data as well - can you please elaborate on the error you are getting and provide a code sample? As a preliminary hint, you can "aggregate" text values using *max*. df.groupBy("someCol") .pivot("anotherCol") .agg(max($"textC

Re: spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
f jobs, the logged SparkEvents of which stick around in order > for the UI to render. There are some options under `spark.ui.retained*` to > limit that if it's a problem. > > > On Mon, Jan 9, 2017 at 6:00 PM, Kevin Burton <bur...@spinn3r.com> wrote: > >> We've had

spark-shell running out of memory even with 6GB ?

2017-01-09 Thread Kevin Burton
it more of course but would be nice to know if this is a legitimate memory constraint or there is a bug somewhere. PS: One thought I had was that it would be nice to have spark keep track of where an OOM was encountered, in what component. Kevin -- We’re hiring if you know of any awesome Java

Re: Spark app write too many small parquet files

2016-12-08 Thread Kevin Tran
. What is the practise for number of file size and files ? How to compacting small parquet flies to small number of bigger parquet file ? Thanks, Kevin. On Tue, Nov 29, 2016 at 3:01 AM, Chin Wei Low <lowchin...@gmail.com> wrote: > Try limit the partitions. spark.sql.shuffle.p

OutOfMemoryError while running job...

2016-12-06 Thread Kevin Burton
I am trying to run a Spark job which reads from ElasticSearch and should write it's output back to a separate ElasticSearch index. Unfortunately I keep getting `java.lang.OutOfMemoryError: Java heap space` exceptions. I've tried running it with: --conf spark.memory.offHeap.enabled=true --conf

Re: Spark app write too many small parquet files

2016-11-28 Thread Kevin Tran
Hi Denny, Thank you for your inputs. I also use 128 MB but still too many files generated by Spark app which is only ~14 KB each ! That's why I'm asking if there is a solution for this if some one has same issue. Cheers, Kevin. On Mon, Nov 28, 2016 at 7:08 PM, Denny Lee <denny.g@gmail.

Spark app write too many small parquet files

2016-11-27 Thread Kevin Tran
Should it write each chunk of bigger data size (such as 128 MB) with proper number of files ? Does anyone find out any performance changes when changing data size of each parquet file ? Thanks, Kevin.

Re: Nearest neighbour search

2016-11-14 Thread Kevin Mellott
You may be able to benefit from Soundcloud's open source implementation, either as a solution or as a reference implementation. https://github.com/soundcloud/cosine-lsh-join-spark Thanks, Kevin On Sun, Nov 13, 2016 at 2:07 PM, Meeraj Kunnumpurath < mee...@servicesymphony.com>

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
will have to check out HBase as well; I've heard good things! Thanks, Kevin On Mon, Oct 10, 2016 at 11:38 AM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi Kevin, > > What is the streaming interval (batch interval) above? > > I do analytics on streaming trade data b

Re: Spark Streaming Advice

2016-10-10 Thread Kevin Mellott
seconds to around 1 second. // ssc = instance of SparkStreamingContext ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") I've also verified that the parquet files being generated are usable by both Hive and Impala. Hope that helps! Kevin

Re: Spark ML Decision Trees Algorithm

2016-09-30 Thread Kevin Mellott
The documentation details the algorithm being used at http://spark.apache.org/docs/latest/mllib-decision-tree.html Thanks, Kevin On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty <janardhan...@gmail.com> wrote: > Hi, > > Any help here is appreciated .. > > On Wed, Sep

Re: Dataframe Grouping - Sorting - Mapping

2016-09-30 Thread Kevin Mellott
api/scala/index.html#org.apache.spark.sql.DataFrame Thanks, Kevin On Fri, Sep 30, 2016 at 5:46 AM, AJT <at...@currenex.com> wrote: > I'm looking to do the following with my Spark dataframe > (1) val df1 = df.groupBy() > (2) val df2 = df1.sort() > (3) val df3 = df2.mapPartitions(

Extract timestamp from Kafka message

2016-09-25 Thread Kevin Tran
.class, kafkaParams, topics ); Thanks, Kevin.

Re: Optimal/Expected way to run demo spark-scala scripts?

2016-09-23 Thread Kevin Mellott
://databricks.com/try-databricks Thanks, Kevin On Fri, Sep 23, 2016 at 2:37 PM, Dan Bikle <bikle...@gmail.com> wrote: > hello spark-world, > > I am new to spark and want to learn how to use it. > > I come from the Python world. > > I see an example at the url below: >

Re: In Spark-scala, how to fill Vectors.dense in DataFrame from CSV?

2016-09-22 Thread Kevin Mellott
You'll want to use the spark-csv package, which is included in Spark 2.0. The repository documentation has some great usage examples. https://github.com/databricks/spark-csv Thanks, Kevin On Thu, Sep 22, 2016 at 8:40 PM, Dan Bikle <bikle...@gmail.com> wrote: > hello spark-world, >

Re: unresolved dependency: datastax#spark-cassandra-connector;2.0.0-s_2.11-M3-20-g75719df: not found

2016-09-21 Thread Kevin Mellott
3 You can verify the available versions by searching Maven at http://search.maven.org. Thanks, Kevin On Wed, Sep 21, 2016 at 3:38 AM, muhammet pakyürek <mpa...@hotmail.com> wrote: > while i run the spark-shell as below > > spark-shell --jars '/home/ktuser/spark-cassandra- > connector

Re: Similar Items

2016-09-20 Thread Kevin Mellott
Using the Soundcloud implementation of LSH, I was able to process a 22K product dataset in a mere 65 seconds! Thanks so much for the help! On Tue, Sep 20, 2016 at 1:15 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Thanks Nick - those examples will help a ton!! > > On Tu

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
one also > > On Sep 20, 2016 10:44 PM, "Kevin Mellott" <kevin.r.mell...@gmail.com> > wrote: > >> Instead of *mode="append"*, try *mode="overwrite"* >> >> On Tue, Sep 20, 2016 at 11:30 AM, Sankar Mittapally < >> sankar.m

Re: Similar Items

2016-09-20 Thread Kevin Mellott
looked at. > https://github.com/soundcloud/cosine-lsh-join-spark - not used this but > looks like it should do exactly what you need. > https://github.com/mrsqueeze/*spark*-hash > <https://github.com/mrsqueeze/spark-hash> > > > On Tue, 20 Sep 2016 at 18:06 Kevin Mellott <

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
uot;) > > I tried these two commands. > write.df(sankar2,"/nfspartition/sankar/test/test.csv","csv",header="true") > > saveDF(sankar2,"sankartest.csv",source="csv",mode="append",schema="true") > > > > On Tue,

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
Can you please post the line of code that is doing the df.write command? On Tue, Sep 20, 2016 at 9:29 AM, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Hey Kevin, > > It is a empty directory, It is able to write part files to the directory > but while mergin

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Kevin Mellott
/api/R/write.df.html Thanks, Kevin On Tue, Sep 20, 2016 at 12:16 AM, sankarmittapally < sankar.mittapa...@creditvidya.com> wrote: > We have setup a spark cluster which is on NFS shared storage, there is no > permission issues with NFS storage, all the users are able to write to NFS

Similar Items

2016-09-19 Thread Kevin Mellott
do this in Spark? I tried creating a UDF (that used the Breeze linear algebra methods internally); however, that did not scale well. Thanks, Kevin

Re: study materials for operators on Dataframe

2016-09-19 Thread Kevin Mellott
I would recommend signing up for a Databricks Community Edition account. It will give you access to a 6GB cluster, with many different example programs that you can use to get started. https://databricks.com/try-databricks If you are looking for a more formal training method, I just completed

Re: driver OOM - need recommended memory for driver

2016-09-19 Thread Kevin Mellott
html Hope that helps! Kevin On Mon, Sep 19, 2016 at 9:32 AM, Anand Viswanathan < anand_v...@ymail.com.invalid> wrote: > Hi, > > Spark version :spark-1.5.2-bin-hadoop2.6 ,using pyspark. > > I am running a machine learning program, which runs perfectly by > specifyin

Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-19 Thread Kevin Burton
I tried with write.json and write.csv. The write.text method won't work because I have more than one column and refuses to execute. Doesn't seem to work on any data. On Sat, Sep 17, 2016 at 10:52 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > Hi Kevin, > > I hav

Re: Missing output partition file in S3

2016-09-19 Thread Chen, Kevin
, Kevin From: Steve Loughran <ste...@hortonworks.com<mailto:ste...@hortonworks.com>> Date: Friday, September 16, 2016 at 3:46 AM To: Chen Kevin <kevin.c...@neustar.biz<mailto:kevin.c...@neustar.biz>> Cc: "user@spark.apache.org<mailto:user@spark.apache.org>&q

take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Kevin Burton
I'm seeing some weird behavior and wanted some feedback. I have a fairly large, multi-hour job that operates over about 5TB of data. It builds it out into a ranked category index of about 25000 categories sorted by rank, descending. I want to write this to a file but it's not actually writing

Missing output partition file in S3

2016-09-15 Thread Chen, Kevin
Hi, Has any one encountered an issue of missing output partition file in S3 ? My spark job writes output to a S3 location. Occasionally, I noticed one partition file is missing. As a result, one chunk of data was lost. If I rerun the same job, the problem usually goes away. This has been

Add sqldriver.jar to Spark 1.6.0 executors

2016-09-14 Thread Kevin Tran
me ! Does anyone have Spark app work with driver jar on executors before please give me your ideas. Thank you. Cheers, Kevin.

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
rally use --conf to set this on the command line if using > the shell. > > > On Tue, Sep 13, 2016, 19:22 Kevin Burton <bur...@spinn3r.com> wrote: > >> The problem is that without a new spark context, with a custom conf, >> elasticsearch-hadoop is refusing to read in se

Re: Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 13 September 2016 at 18:57, Sean Owen <so...@cloudera.com> wrote: > &g

Spark 2.0.0 won't let you create a new SparkContext?

2016-09-13 Thread Kevin Burton
I'm rather confused here as to what to do about creating a new SparkContext. Spark 2.0 prevents it... (exception included below) yet a TON of examples I've seen basically tell you to create a new SparkContext as standard practice:

"Too many elements to create a power set" on Elasticsearch

2016-09-11 Thread Kevin Burton
1.6.1 and 1.6.2 don't work on our Elasticsearch setup because we use daily indexes. We get the error: "Too many elements to create a power set" It works on SINGLE indexes.. but if I specify content_* then I get this error. I don't see this documented anywhere. Is this a known issue? Is there

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
, Sep 10, 2016 at 7:42 PM, Kevin Burton <bur...@spinn3r.com> wrote: > Ah.. might actually. I'll have to mess around with that. > > On Sat, Sep 10, 2016 at 6:06 PM, Karl Higley <kmhig...@gmail.com> wrote: > >> Would `topByKey` help? >> >> https://github.c

Re: Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
scala#L42 > > Best, > Karl > > On Sat, Sep 10, 2016 at 9:04 PM Kevin Burton <bur...@spinn3r.com> wrote: > >> I'm trying to figure out a way to group by and return the top 100 records >> in that group. >> >> Something like: >> >> SELECT TOP(

Selecting the top 100 records per group by?

2016-09-10 Thread Kevin Burton
I'm trying to figure out a way to group by and return the top 100 records in that group. Something like: SELECT TOP(100, user_id) FROM posts GROUP BY user_id; But I can't really figure out the best way to do this... There is a FIRST and LAST aggregate function but this only returns one column.

Re: call() function being called 3 times

2016-09-07 Thread Kevin Tran
] INFO org.apache.spark.executor.Executor - Finished task 0.0 in stage 12.0 (TID 12). 2518 bytes result sent to driver Does anyone have any ideas? On Wed, Sep 7, 2016 at 7:30 PM, Kevin Tran <kevin...@gmail.com> wrote: > Hi Everyone, > Does anyone know why call() function bei

call() function being called 3 times

2016-09-07 Thread Kevin Tran
1* > > SQLContext sqlContext = new SQLContext(rdd.context()); > > >> JavaRDD rowRDD = rdd.map(new Function<String, JavaBean>() { > > public JavaBean call(String record) { >> *<== being called 3 times* > > What I tried: * *cache()* * cleaning up *checkpoint dir* Thanks, Kevin.

Re: Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi Mich, Thank you for your input. Does monotonically incremental ensure about race condition and does it duplicates the ids at some points with multi threads, multi instances, ... ? Even System.currentTimeMillis() still has duplication? Cheers, Kevin. On Mon, Sep 5, 2016 at 12:30 AM, Mich

Best ID Generator for ID field in parquet ?

2016-09-04 Thread Kevin Tran
Hi everyone, Please give me your opinions on what is the best ID Generator for ID field in parquet ? UUID.randomUUID(); AtomicReference currentTime = new AtomicReference<>(System.currentTimeMillis()); AtomicLong counter = new AtomicLong(0); Thanks, Kevin. https://issues.apac

Re: Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
reference architecture which HBase is apart of ? Please share with me best practises you might know or your favourite designs. Thanks, Kevin. On Mon, Aug 29, 2016 at 5:18 AM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > Can you explain about you p

Best practises to storing data in Parquet files

2016-08-28 Thread Kevin Tran
Hi, Does anyone know what is the best practises to store data to parquet file? Does parquet file has limit in size ( 1TB ) ? Should we use SaveMode.APPEND for long running streaming app ? How should we store in HDFS (directory structure, ... )? Thanks, Kevin.

Spark StringType could hold how many characters ?

2016-08-28 Thread Kevin Tran
could handle ? In the Spark code: org.apache.spark.sql.types.StringType /** * The default size of a value of the StringType is 4096 bytes. */ override def defaultSize: Int = 4096 Thanks, Kevin.

Write parquet file from Spark Streaming

2016-08-27 Thread Kevin Tran
Hi Everyone, Does anyone know how to write parquet file after parsing data in Spark Streaming? Thanks, Kevin.

Re: tpcds for spark2.0

2016-08-01 Thread kevin
:17 GMT+08:00 Olivier Girardot <o.girar...@lateral-thoughts.com >: > I have the same kind of issue (not using spark-sql-perf), just trying to > deploy 2.0.0 on mesos. > I'll keep you posted as I investigate > > > > On Wed, Jul 27, 2016 1:06 PM, kevin kiss.kevin...@gm

Re: spark.read.format("jdbc")

2016-08-01 Thread kevin
UES('user_id','email','gender')" > statement.executeUpdate(sql_insert) > > > Also you should specify path your jdbc jar file in --driver-class-path > variable when you running spark-submit: > > spark-shell --master "local[2]" --driver-class-path > /opt/cloudera/p

Re: spark.read.format("jdbc")

2016-08-01 Thread kevin
maybe there is another version spark on the classpath? 2016-08-01 14:30 GMT+08:00 kevin <kiss.kevin...@gmail.com>: > hi,all: >I try to load data from jdbc datasource,but I got error with : > java.lang.RuntimeException: Multiple sources

spark.read.format("jdbc")

2016-08-01 Thread kevin
hi,all: I try to load data from jdbc datasource,but I got error with : java.lang.RuntimeException: Multiple sources found for jdbc (org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider, org.apache.spark.sql.execution.datasources.jdbc.DefaultSource), please specify the fully

tpcds for spark2.0

2016-07-27 Thread kevin
hi,all: I want to have a test about tpcds99 sql run on spark2.0. I user https://github.com/databricks/spark-sql-perf about the master version ,when I run :val tpcds = new TPCDS (sqlContext = sqlContext) I got error: scala> val tpcds = new TPCDS (sqlContext = sqlContext) error: missing or invalid

Re: dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
collect*. > That mean until you call collect spark* do nothing* so you df would not > have any data -> can’t call foreach. > Call collect execute the process -> get data -> foreach is ok. > > > On Jul 26, 2016, at 2:30 PM, kevin <kiss.kevin...@gmail.com> wrote: > > blacklistDF.collect() > > >

dataframe.foreach VS dataframe.collect().foreach

2016-07-26 Thread kevin
HI ALL: I don't quite understand the different between : dataframe.foreach and dataframe.collect().foreach . When to use dataframe.foreach? I use spark2.0 ,I want to iterate a dataframe to get one colum's value : this can work out blacklistDF.collect().foreach { x =>

Re: spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
thanks a lot Terry 2016-07-26 12:03 GMT+08:00 Terry Hoo <hujie.ea...@gmail.com>: > Kevin, > > Try to create the StreamingContext as following: > > val ssc = new StreamingContext(spark.sparkContext, Seconds(2)) > > > > On Tue, Jul 26, 2016 at 11:25 AM, kevi

spark2.0 how to use sparksession and StreamingContext same time

2016-07-25 Thread kevin
hi,all: I want to read data from kafka and regist as a table then join a jdbc table. My sample like this : val spark = SparkSession .builder .config(sparkConf) .getOrCreate() val jdbcDF = spark.read.format("jdbc").options(Map("url" -> "jdbc:mysql://master1:3306/demo",

Re: Odp.: spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
r - 2.10 - version > > > > -- > *Od:* kevin <kiss.kevin...@gmail.com> > *Wysłane:* 25 lipca 2016 11:33 > *Do:* user.spark; dev.spark > *Temat:* spark2.0 can't run SqlNetworkWordCount > > hi,all: > I download spark2.0 per-build. I

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
oduction stability level as Spark 1.6. There is also Kafka 0.10 > support in > > dstream. > > > > On July 25, 2016 at 10:26:49 AM, Andy Davidson > > (a...@santacruzintegration.com) wrote: > > > > Hi Kevin > > > > Just a heads up at the recent

spark2.0 can't run SqlNetworkWordCount

2016-07-25 Thread kevin
hi,all: I download spark2.0 per-build. I can run SqlNetworkWordCount test use : bin/run-example org.apache.spark.examples.streaming.SqlNetworkWordCount master1 but when I use spark2.0 example source code SqlNetworkWordCount.scala and build it to a jar bao with dependencies ( JDK 1.8 AND

Re: where I can find spark-streaming-kafka for spark2.0

2016-07-25 Thread kevin
I have compile it from source code 2016-07-25 12:05 GMT+08:00 kevin <kiss.kevin...@gmail.com>: > hi,all : > I try to run example org.apache.spark.examples.streaming.KafkaWordCount , > I got error : > Exception in thread "main" java.lang.NoClassDefFoundError: > o

where I can find spark-streaming-kafka for spark2.0

2016-07-24 Thread kevin
hi,all : I try to run example org.apache.spark.examples.streaming.KafkaWordCount , I got error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$ at org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57) at

ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread kevin
hi,all: I build spark use: ./make-distribution.sh --name "hadoop2.7.1" --tgz "-Pyarn,hadoop-2.6,parquet-provided,hive,hive-thriftserver" -DskipTests -Dhadoop.version=2.7.1 I can run example : ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master spark://master1:7077 \

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Yeah.. thanks Nick. Figured that out since your last email... I deleted the 2.10 by accident but then put 2+2 together. Got it working now. Still sticking to my story that it's somewhat complicated to setup :) Kevin On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath <nick.pentre...@gmail.

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
haps > not with recent Spark versions). > > > > On Thu, 2 Jun 2016 at 15:34 Kevin Burton <bur...@spinn3r.com> wrote: > >> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's >> not super easy. >> >> I wish there was an easier way to get this

Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
d try again. Lots of trial and error here :-/ Kevin -- We’re hiring if you know of any awesome Java Devops or Linux Operations Engineers! Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile <https://plus.go

Compute the global rank of the column

2016-05-31 Thread Dai, Kevin
Hi, All I want to compute the rank of some column in a table. Currently, I use the window function to do it. However all data will be in one partition. Is there better solution to do it? Regards, Kevin.

Re: Alternative to groupByKey() + mapValues() for non-commutative, non-associative aggregate?

2016-05-03 Thread Kevin Mellott
If you put this into a dataframe then you may be able to use one hot encoding and treat these as categorical features. I believe that the ml pipeline components use project tungsten so the performance will be very fast. After you process the result on the dataframe you would then need to assemble

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
em > you are > >> filtering all the rows with null columns in those fields. In other > words you > >> are doing a inner join in all your queries. > >> > >> On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta < > gourav.sengu...@gmail.com>

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
all the rows with null columns in those fields. In other words > you > > are doing a inner join in all your queries. > > > > On Tue, May 3, 2016 at 11:37 AM, Gourav Sengupta < > gourav.sengu...@gmail.com> > > wrote: > >> > >> Hi Kevin, > &

Re: Weird results with Spark SQL Outer joins

2016-05-03 Thread Kevin Peng
pe show the same > results, > which meant that all the rows from left could match at least one row from > right, > all the rows from right could match at least one row from left, even > the number of row from left does not equal that of right. > > This is correct result. > >

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Yong, Sorry, let explain my deduction; it is going be difficult to get a sample data out since the dataset I am using is proprietary. >From the above set queries (ones mentioned in above comments) both inner and outer join are producing the same counts. They are basically pulling out selected

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, I wish that was case, but I have done a select count on each of the two tables individually and they return back different number of rows: dps.registerTempTable("dps_pin_promo_lt") swig.registerTempTable("swig_pin_promo_lt") dps.count() RESULT: 42632 swig.count() RESULT: 42034

Re: Weird results with Spark SQL Outer joins

2016-05-02 Thread Kevin Peng
Gourav, Apologies. I edited my post with this information: Spark version: 1.6 Result from spark shell OS: Linux version 2.6.32-431.20.3.el6.x86_64 ( mockbu...@c6b9.bsys.dev.centos.org) (gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) ) #1 SMP Thu Jun 19 21:14:45 UTC 2014 Thanks, KP On Mon,

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-16 Thread Kevin Eid
One last email to announce that I've fixed all of the issues. Don't hesitate to contact me if you encounter the same. I'd be happy to help. Regards, Kevin On 14 Apr 2016 12:39 p.m., "Kevin Eid" <kevin.e...@mail.dcu.ie> wrote: > Hi all, > > I managed to copy my .py files

Re: Run a self-contained Spark app on a Spark standalone cluster

2016-04-12 Thread Kevin Eid
ions about how to move those files from local to the cluster? Thanks in advance, Kevin On 12 April 2016 at 12:19, Sun, Rui <rui@intel.com> wrote: > Which py file is your main file (primary py file)? Zip the other two py > files. Leave the main py file alone. Don't copy them

Introducing Spark User Group in Korea & Question on creating non-software goods (stickers)

2016-04-01 Thread Kevin (Sangwoo) Kim
Hi all! I'm Kevin, one of contributors of Spark and I'm organizing Spark User Group in Korea. We're having 2500 members in community, and it's even growing faster today. https://www.facebook.com/groups/sparkkoreauser/ <https://www.facebook.com/groups/sparkkoreauser/?__mref=message_bubble> -

Re: println not appearing in libraries when running job using spark-submit --master local

2016-03-28 Thread Kevin Peng
Ted, What triggerAndWait does is perform a rest call to a specified url and then waits until the status message that gets returned by that url in a json a field says complete. The issues is I put a println at the very top of the method and that doesn't get printed out, and I know that println

java.lang.OutOfMemoryError: Direct buffer memory when using broadcast join

2016-03-21 Thread Dai, Kevin
(SingleThreadEventExecutor.java:111) at java.lang.Thread.run(Thread.java:745) Can anyone tell me what's wrong and how to fix it? Best Regards, Kevin.

Re: How to convert Parquet file to a text file.

2016-03-15 Thread Kevin Mellott
I'd recommend reading the parquet file into a DataFrame object, and then using spark-csv to write to a CSV file. On Tue, Mar 15, 2016 at 3:34 PM, Shishir Anshuman wrote: > I need to convert the parquet file generated by the

Reducing multiple values

2016-03-14 Thread Kevin Mc Ghee
Hi all, For each record I’m processing in a Spark streaming app (written in Java) I need to take over 30 datapoints. The output of my map would be something like: KEY1,1,0,1,0,30,1,1,1,1,0,30,… KEY1,0,1,1,0,15,1,1,1,1,0,28,… KEY2,0,1,1,0,22,1,1,1,1,0,0,… And I want to end up with:

  1   2   3   >