Re: Starting with spark

2014-07-24 Thread Jerry
. Best Regards, Jerry Sent from my iPad On Jul 24, 2014, at 6:53 AM, Sameer Sayyed sam.sayyed...@gmail.com wrote: Hello All, I am new user of spark, I am using cloudera-quickstart-vm-5.0.0-0-vmware for execute sample examples of Spark. I am very sorry for silly and basic question. I am

What am I missing that's preventing javac from finding the libraries (CLASSPATH is setup...)?

2015-08-18 Thread Jerry
org.apache.spark.sql.hive.*; Let me know what I'm doing wrong. Thanks, Jerry

Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
the spark shell, so all I do is Test.run(sc) in shell. Let me know what to look for to debug this problem. I'm not sure where to look to solve this problem. Thanks, Jerry

Re: Is there any external dependencies for lag() and lead() when using data frames?

2015-08-10 Thread Jerry
By the way, if Hive is present in the Spark install, does show up in text when you start the spark shell? Any commands I can run to check if it exists? I didn't setup the spark machine that I use, so I don't know what's present or absent. Thanks, Jerry On Mon, Aug 10, 2015 at 2:38 PM

Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
So it seems like dataframes aren't going give me a break and just work. Now it evaluates but goes nuts if it runs into a null case OR doesn't know how to get the correct data type when I specify the default value as a string expression. Let me know if anyone has a work around to this. PLEASE HELP

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
those links point me to something useful. Let me know if you can run the above code/ what you did different to get that code to run. Thanks, Jerry On Fri, Aug 14, 2015 at 1:23 PM, Salih Oztop soz...@yahoo.com wrote: Hi Jerry, This blog post is perfect for window functions in Spark. https

Re: Another issue with using lag and lead with data frames

2015-08-14 Thread Jerry
) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) On Fri, Aug 14, 2015 at 1:39 PM, Jerry jerry.c...@gmail.com wrote: Hi Salih, Normally I do sort before

Re: Does the driver program always run local to where you submit the job from?

2015-08-26 Thread Jerry
Thanks! On Wed, Aug 26, 2015 at 2:06 PM, Marcelo Vanzin van...@cloudera.com wrote: On Wed, Aug 26, 2015 at 2:03 PM, Jerry jerry.c...@gmail.com wrote: Assuming your submitting the job from terminal; when main() is called, if I try to open a file locally, can I assume the machine is always

Fast way to parse JSON in Spark

2016-02-23 Thread Jerry
. The json messages are coming from Kafka consumer. It's over 1,500 messages per second. So the message processing (parser and write to Cassandra) is also need to be completed at the same time (1,500/second). Thanks in advance. Jerry I appreciate it if you can give me any helps and advice.

Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

2016-02-16 Thread Jerry
time) But the Cassandra can only be inserted about 100 messages in each round of test. Can anybody give me advices why the other messages (about 900 message) can't be consumed? How do I configure and tune the parameters in order to improve the throughput of consumers? Thank you very much fo

Re: Optimize the performance of inserting data to Cassandra with Kafka and Spark Streaming

2016-02-17 Thread Jerry
Rado, Yes. you are correct. A lots of messages are created almost in the same time (even use milliseconds). I changed to use "UUID.randomUUID()" with which all messages can be inserted in the Cassandra table without time lag. Thank you very much! Jerry Wong On Wed, Feb 17, 2016

Re: Missing data in Kafka Consumer

2016-05-05 Thread Jerry
Hi David, Thank you for your response. Before inserting to Cassandra, I had checked the data have already missed at HDFS (My second step is to load data from HDFS and then insert to Cassandra). Can you send me the link relating this bug of 0.8.2? Thank you! Jerry On Thu, May 5, 2016 at 12:38

Missing data in Kafka Consumer

2016-05-05 Thread Jerry
and confirmed the same number in the Broker. But when I checked either HDFS or Cassandra, the number is just 363. The data is not always lost, just sometimes... That's wired and annoying to me. Can anybody give me some reasons? Thanks! Jerry -- View this message in context: http://apache-spark-user

Re: Sample Project for using Shark API in Spark programs

2014-04-07 Thread Jerry Lam
Hi Shark, Should I assume that Shark users should not use the shark APIs since there are no documentations for it? If there are documentations, can you point it out? Best Regards, Jerry On Thu, Apr 3, 2014 at 9:24 PM, Jerry Lam chiling...@gmail.com wrote: Hello everyone, I have

Re: hbase scan performance

2014-04-09 Thread Jerry Lam
it with spark, I don't think you can get a lot of performance from scanning HBase unless you are talking about caching the results from HBase in spark and reuse it over and over. HTH, Jerry On Wed, Apr 9, 2014 at 12:02 PM, David Quigley dquigle...@gmail.com wrote: Hi all, We are currently using hbase

Spark Summit 2014 (Hotel suggestions)

2014-05-06 Thread Jerry Lam
Hi Spark users, Do you guys plan to go the spark summit? Can you recommend any hotel near the conference? I'm not familiar with the area. Thanks! Jerry

Re: Spark Summit 2014 (Hotel suggestions)

2014-05-27 Thread Jerry Lam
Hi guys, I ended up reserving a room at the Phoenix (Hotel: http://www.jdvhotels.com/hotels/california/san-francisco-hotels/phoenix-hotel) recommended by my friend who has been in SF. According to Google, it takes 11min to walk to the conference which is not too bad. Hope this helps! Jerry

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Jerry Lam
the error you saw. By reducing the number of cores, there are more cpu resources available to a task so the GC could finish before the error gets throw. HTH, Jerry On Tue, Jul 8, 2014 at 1:35 PM, Aaron Davidson ilike...@gmail.com wrote: There is a difference from actual GC overhead, which can

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
+1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM,

Re: Purpose of spark-submit?

2014-07-09 Thread Jerry Lam
that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without

Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
or it is a bug in spark sql? Best Regards, Jerry

Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
issue? For the curious mind, the dataset is about 200-300GB and we are using 10 machines for this benchmark. Given the env is equal between the two experiments, why pure spark is faster than SparkSQL? Best Regards, Jerry

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql(select * from m).count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam chiling...@gmail.com wrote: Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam chiling...@gmail.com wrote: By the way, I also try hql(select * from m).count. It is terribly

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
overhead, then there must be something additional that SparkSQL adds to the overall overheads that Hive doesn't have. Best Regards, Jerry On Thu, Jul 10, 2014 at 7:11 PM, Michael Armbrust mich...@databricks.com wrote: On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam chiling...@gmail.com wrote

Re: Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
[], (MetastoreRelation test, m, None), None HiveTableScan [id#106], (MetastoreRelation test, s, Some(s)), None Best Regards, Jerry On Thu, Jul 10, 2014 at 7:16 PM, Michael Armbrust mich...@databricks.com wrote: Hi Jerry, Thanks for reporting this. It would be helpful if you could

Re: Can we get a spark context inside a mapper

2014-07-14 Thread Jerry Lam
of spark, but maybe not. HTH, Jerry On Mon, Jul 14, 2014 at 3:09 PM, Matei Zaharia matei.zaha...@gmail.com wrote: You currently can't use SparkContext inside a Spark task, so in this case you'd have to call some kind of local K-means library. One example you can try to use is Weka (http

Re: How to kill running spark yarn application

2014-07-14 Thread Jerry Lam
Then yarn application -kill appid should work. This is what I did 2 hours ago. Sorry I cannot provide more help. Sent from my iPhone On 14 Jul, 2014, at 6:05 pm, hsy...@gmail.com hsy...@gmail.com wrote: yarn-cluster On Mon, Jul 14, 2014 at 2:44 PM, Jerry Lam chiling...@gmail.com wrote

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
Hi Rajesh, can you describe your spark cluster setup? I saw localhost:2181 for zookeeper. Best Regards, Jerry On Tue, Jul 15, 2014 at 9:47 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Could you please help me to resolve the issue. *Issue *: I'm not able to connect

Re: Need help on spark Hbase

2014-07-15 Thread Jerry Lam
. uber jar it and run it just like any other simple java program. If you still have connection issues, then at least you know the problem is from the configurations. HTH, Jerry On Tue, Jul 15, 2014 at 12:10 PM, Krishna Sankar ksanka...@gmail.com wrote: One vector to check is the HBase libraries

Re: Repeated data item search with Spark SQL(1.0.1)

2014-07-15 Thread Jerry Lam
://issues.apache.org/jira/browse/SPARK-2483 seems to address only HiveQL. Best Regards, Jerry On Tue, Jul 15, 2014 at 3:38 AM, anyweil wei...@gmail.com wrote: Thank you so much for the information, now i have merge the fix of #1411 and seems the HiveSQL works with: SELECT name FROM people WHERE

Re: Need help on spark Hbase

2014-07-16 Thread Jerry Lam
. --jars A.jar,B.jar,C.jar not --jars A.jar, B.jar, C.jar I'm just guessing because when I used --jars I never have spaces in it. HTH, Jerry On Wed, Jul 16, 2014 at 5:30 AM, Madabhattula Rajesh Kumar mrajaf...@gmail.com wrote: Hi Team, Now i've changed my code and reading configuration from

Spark SQL UDF returning a list?

2014-12-03 Thread Jerry Raj
java.lang.RuntimeException: [1.57] failure: ``('' expected but identifier myudf found I also tried returning a List of Ints, that did not work either. Is there a way to write a UDF that returns a list? Thanks -Jerry - To unsubscribe, e

Spark SQL with a sorted file

2014-12-03 Thread Jerry Raj
Hi, If I create a SchemaRDD from a file that I know is sorted on a certain field, is it possible to somehow pass that information on to Spark SQL so that SQL queries referencing that field are optimized? Thanks -Jerry

Filtering nested data using Spark SQL

2014-12-10 Thread Jerry Lam
with name = apple with early stopping. Is this possible? If yes, how one implements the contain function? Best Regards, Jerry

Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
in which I can do that. The farthest I can get to is to convert items.toSeq. The type information I got back is: scala items.toSeq res57: Seq[Any] = [WrappedArray([1,orange],[2,apple])] Any suggestion? Best Regards, Jerry

Re: Accessing rows of a row in Spark

2014-12-15 Thread Jerry Lam
Hi Mark, Thank you for helping out. The items I got back from Spark SQL has the type information as follows: scala items res16: org.apache.spark.sql.Row = [WrappedArray([1,orange],[2,apple])] I tried to iterate the items as you suggested but no luck. Best Regards, Jerry On Mon, Dec 15

Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj
.user_id == t2.user_id) nor t1.join(t2, on = Some('t1.user_id == t2.user_id)) work, or even compile. I could not find any examples of how to perform a join using the DSL. Any pointers will be appreciated :) Thanks -Jerry

Re: Spark SQL DSL for joins?

2014-12-16 Thread Jerry Raj
Another problem with the DSL: t1.where('term == dmin).count() returns zero. But sqlCtx.sql(select * from t1 where term = 'dmin').count() returns 700, which I know is correct from the data. Is there something wrong with how I'm using the DSL? Thanks On 17/12/14 11:13 am, Jerry Raj wrote

Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi spark users, Do you know how to read json files using Spark SQL that are LZO compressed? I'm looking into sqlContext.jsonFile but I don't know how to configure it to read lzo files. Best Regards, Jerry

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
Hi Ted, Thanks for your help. I'm able to read lzo files using sparkContext.newAPIHadoopFile but I couldn't do the same for sqlContext because sqlContext.josnFile does not provide ways to configure the input file format. Do you know if there are some APIs to do that? Best Regards, Jerry On Wed

Re: Spark SQL 1.1.1 reading LZO compressed json files

2014-12-17 Thread Jerry Lam
) In some scenarios, Hadoop is faster because it is saving one stage. Did I do something wrong? Best Regards, Jerry On Wed, Dec 17, 2014 at 1:29 PM, Michael Armbrust mich...@databricks.com wrote: You can create an RDD[String] using whatever method and pass that to jsonRDD. On Wed, Dec 17, 2014

UNION two RDDs

2014-12-18 Thread Jerry Lam
Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry

Re: UNION two RDDs

2014-12-22 Thread Jerry Lam
Hi Sean and Madhu, Thank you for the explanation. I really appreciate it. Best Regards, Jerry On Fri, Dec 19, 2014 at 4:50 AM, Sean Owen so...@cloudera.com wrote: coalesce actually changes the number of partitions. Unless the original RDD had just 1 partition, coalesce(1) will make an RDD

Re: Spark SQL with a sorted file

2014-12-22 Thread Jerry Raj
Michael, Thanks. Is this still turned off in the released 1.2? Is it possible to turn it on just to get an idea of how much of a difference it makes? -Jerry On 05/12/14 12:40 am, Michael Armbrust wrote: I'll add that some of our data formats will actual infer this sort of useful information

SparkSQL: CREATE EXTERNAL TABLE with a SchemaRDD

2014-12-23 Thread Jerry Lam
) at org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382) Is this supported? Best Regards, Jerry

Spark or Tachyon: capture data lineage

2015-01-02 Thread Jerry Lam
. Is this something already possible with spark/tachyon? If not, do you think it is possible? Does anyone mind to share their experience in capturing the data lineage in a data processing pipeline? Best Regards, Jerry

Re: Reading from CSV file with spark-csv_2.10

2015-02-05 Thread Jerry Lam
. However, I didn't use the spark-csv package though. I did that manually so I cannot comment on the spark-csv. HTH, Jerry On Thu, Feb 5, 2015 at 9:32 AM, Spico Florin spicoflo...@gmail.com wrote: Hello! I'm using spark-csv 2.10 with Java from the maven repository groupIdcom.databricks/groupId

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-19 Thread Jerry Lam
Hi guys, Does this issue affect 1.2.0 only or all previous releases as well? Best Regards, Jerry On Thu, Jan 8, 2015 at 1:40 AM, Xuelin Cao xuelincao2...@gmail.com wrote: Yes, the problem is, I've turned the flag on. One possible reason for this is, the parquet file supports predicate

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Jerry Lam
not affiliate with Cloudera but it seems they are the only one who is very active in the spark project and provides a hadoop distribution. HTH, Jerry btw, who is Paco Nathan? On Thu, Jan 22, 2015 at 10:03 AM, Babu, Prashanth prashanth.b...@nttdata.com wrote: Sudipta, Use the Docker image

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, what do you mean by stuck? Jerry On Mon, Feb 2, 2015 at 12:44 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: Hi, Is there any better operation than Union. I am using union and the cluster is getting stuck with a large data set. Thank you

Re: Union in Spark

2015-02-01 Thread Jerry Lam
Hi Deep, How do you know the cluster is not responsive because of Union? Did you check the spark web console? Best Regards, Jerry On Mon, Feb 2, 2015 at 1:21 AM, Deep Pradhan pradhandeep1...@gmail.com wrote: The cluster hangs. On Mon, Feb 2, 2015 at 11:25 AM, Jerry Lam chiling

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-09 Thread Jerry Lam
objects. I'm thinking of overriding the saveAsParquetFile method to allows me to persist the avro schema inside parquet. Is this possible at all? Best Regards, Jerry On Fri, Jan 9, 2015 at 2:05 AM, Raghavendra Pandey raghavendra.pan...@gmail.com wrote: I cam across this http://zenfractal.com

Re: IndexedRDD

2015-01-13 Thread Jerry Lam
wasn't that bad at all. If it is not indexed, I expect it to take much longer time. Can IndexedRDD be sorted by keys as well? Best Regards, Jerry On Tue, Jan 13, 2015 at 11:06 AM, Andrew Ash and...@andrewash.com wrote: Hi Jem, Linear time in scaling on the big table doesn't seem

Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Jerry Lam
that. Is there another API that allows me to do this? Best Regards, Jerry

Re: Benchmark results between Flink and Spark

2015-07-05 Thread Jerry Lam
is in comparisons to Flink is one of the immediate questions I have. It would be great if they have the benchmark software available somewhere for other people to experiment. just my 2 cents, Jerry On Sun, Jul 5, 2015 at 4:35 PM, Ted Yu yuzhih...@gmail.com wrote: There was no mentioning

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Guru, Thanks! Great to hear that someone tried it in production. How do you like it so far? Best Regards, Jerry On Tue, Aug 18, 2015 at 11:38 AM, Guru Medasani gdm...@gmail.com wrote: Hi Jerry, Yes. I’ve seen customers using this in production for data science work. I’m currently

Spark return key value pair

2015-08-19 Thread Jerry OELoo
Hi. I want to parse a file and return a key-value pair with pySpark, but result is strange to me. the test.sql is a big fie and each line is usename and password, with # between them, I use below mapper2 to map data, and in my understanding, i in words.take(10) should be a tuple, but the result is

Re: Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
Hi Prabeesh, That's even better! Thanks for sharing Jerry On Tue, Aug 18, 2015 at 1:31 PM, Prabeesh K. prabsma...@gmail.com wrote: Refer this post http://blog.prabeeshk.com/blog/2015/06/19/pyspark-notebook-with-docker/ Spark + Jupyter + Docker On 18 August 2015 at 21:29, Jerry Lam

Spark + Jupyter (IPython Notebook)

2015-08-18 Thread Jerry Lam
cannot do this. Other solutions (e.g. Zeppelin) seem to reinvent the wheel that IPython has already offered years ago. It would be great if someone can educate me the reason behind this. Best Regards, Jerry

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Jerry Lam
into server: /etc/httpd/modules/mod_authz_core.so: cannot open shared object file: No such file or directory [FAILED] Best Regards, Jerry On Mon, Aug 17, 2015 at 11:09 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Howdy folks! I’m interested in hearing about what people think of spark-ec2

Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
, Jerry

Re: Spark Master Build Git Commit Hash

2015-07-30 Thread Jerry Lam
on. Thank you for your help! Jerry On Thu, Jul 30, 2015 at 11:10 AM, Ted Yu yuzhih...@gmail.com wrote: The files were dated 16-Jul-2015 Looks like nightly build either was not published, or published at a different location. You can download spark-1.5.0-SNAPSHOT.tgz and binary-search

Re: Controlling number of executors on Mesos vs YARN

2015-08-11 Thread Jerry Lam
My experience with Mesos + Spark is not great. I saw one executor with 30 CPU and the other executor with 6. So I don't think you can easily configure it without some tweaking at the source code. Sent from my iPad On 2015-08-11, at 2:38, Haripriya Ayyalasomayajula aharipriy...@gmail.com

Re: Parquet without hadoop: Possible?

2015-08-11 Thread Jerry Lam
Just out of curiosity, what is the advantage of using parquet without hadoop? Sent from my iPhone On 11 Aug, 2015, at 11:12 am, saif.a.ell...@wellsfargo.com wrote: I confirm that it works, I was just having this issue: https://issues.apache.org/jira/browse/SPARK-8450 Saif From:

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam
before? Best Regards, Jerry

Re: Accessing S3 files with s3n://

2015-08-09 Thread Jerry Lam
Hi Akshat, Is there a particular reason you don't use s3a? From my experience,s3a performs much better than the rest. I believe the inefficiency is from the implementation of the s3 interface. Best Regards, Jerry Sent from my iPhone On 9 Aug, 2015, at 5:48 am, Akhil Das ak

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Jerry Lam
Great stuff Tim. This definitely will make Mesos users life easier Sent from my iPad On 2015-08-12, at 11:52, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: Thanks Tim, Jerry. On Wed, Aug 12, 2015 at 1:18 AM, Tim Chen t...@mesosphere.io wrote: Yes the options

Unexpected performance issues with Spark SQL using Parquet

2015-07-27 Thread Jerry Lam
. The speed is 4x faster in the data-without-mapping that means that the more columns a parquet file has the slower it is even only a specific column is needed. Anyone has an explanation on this? I was expecting both of them will finish approximate the same time. Best Regards, Jerry

Re: Parquet problems

2015-07-22 Thread Jerry Lam
Hi guys, I noticed that too. Anders, can you confirm that it works on Spark 1.5 snapshot? This is what I tried at the end. It seems it is 1.4 issue. Best Regards, Jerry On Wed, Jul 22, 2015 at 11:46 AM, Anders Arpteg arp...@spotify.com wrote: No, never really resolved the problem, except

Re: Benchmark results between Flink and Spark

2015-07-14 Thread Jerry Lam
similar style off-heap memory mgmt, more planning optimizations *From:* Jerry Lam [mailto:chiling...@gmail.com chiling...@gmail.com] *Sent:* Sunday, July 5, 2015 6:28 PM *To:* Ted Yu *Cc:* Slim Baltagi; user *Subject:* Re: Benchmark results between Flink and Spark Hi guys, I just read

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
You mean this does not work? SELECT key, count(value) from table group by key On Sun, Jul 19, 2015 at 2:28 PM, N B nb.nos...@gmail.com wrote: Hello, How do I go about performing the equivalent of the following SQL clause in Spark Streaming? I will be using this on a Windowed DStream.

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote: All, Can we run different version of Spark using the same Mesos Dispatcher. For example we can run drivers with Spark 1.3 and Spark 1.4 at the same time ? Regards, Madhu

Re: Counting distinct values for a key?

2015-07-19 Thread Jerry Lam
Hi Nikunj, Sorry, I totally misread your question. I think you need to first groupbykey (get all values of the same key together), then follow by mapValues (probably put the values into a set and then take the size of it because you want a distinct count) HTH, Jerry Sent from my iPhone

Re: Spark Mesos Dispatcher

2015-07-19 Thread Jerry Lam
? -- *From:* Jerry Lam [chiling...@gmail.com] *Sent:* Monday, July 20, 2015 8:27 AM *To:* Jahagirdar, Madhu *Cc:* user; d...@spark.apache.org *Subject:* Re: Spark Mesos Dispatcher Yes. Sent from my iPhone On 19 Jul, 2015, at 10:52 pm, Jahagirdar, Madhu madhu.jahagir...@philips.com wrote

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
mory which is a bit odd in my opinion. Any help will be greatly appreciated. Best Regards, Jerry On Sun, Oct 25, 2015 at 9:25 PM, Josh Rosen <rosenvi...@gmail.com> wrote: > Hi Jerry, > > Do you have speculation enabled? A write which produces one million files > / output pa

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
) org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:31) org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395) org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267) On Sun, Oct 25, 2015 at 10:25 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Josh, > >

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
parameters to make it more memory efficient? Best Regards, Jerry On Sun, Oct 25, 2015 at 8:39 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi guys, > > After waiting for a day, it actually causes OOM on the spark driver. I > configure the driver to have 6GB. Note that I didn't c

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-25 Thread Jerry Lam
million files. Not sure why it OOM the driver after the job is marked _SUCCESS in the output folder. Best Regards, Jerry On Sat, Oct 24, 2015 at 9:35 PM, Jerry Lam <chiling...@gmail.com> wrote: > Hi Spark users and developers, > > Does anyone encounter any issue when a spark SQL job

Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required

2015-10-29 Thread Jerry Wong
I used the spark 1.3.1 to populate the event logs to Cassandra. But there is an exception that I could not find out any clauses. Can anybody give me any helps? Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required at

[Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Any idea why it can read the schema from the parquet file but not processing the file? It feels like the hadoop configuration is not sent to the executor for some reasons... Thanks, Jerry

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
oad the parquet file but I cannot perform a count on the parquet file because of the AmazonClientException. It means that the credential is used during the loading of the parquet but not when we are processing the parquet file. How this can happen? Best Regards, Jerry On Tue, Oct 27, 2015 at 2:05 PM,

Re: [Spark-SQL]: Unable to propagate hadoop configuration after SparkContext is initialized

2015-10-27 Thread Jerry Lam
t;key", "value") does not propagate through all SQL jobs within the same SparkContext? I haven't try with Spark Core so I cannot tell. Is there a workaround given it seems to be broken? I need to do this programmatically after the SparkContext is instantiated not before... Best Regards, J

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
Hi Bryan, Did you read the email I sent few days ago. There are more issues with partitionBy down the road: https://www.mail-archive.com/user@spark.apache.org/msg39512.html <https://www.mail-archive.com/user@spark.apache.org/msg39512.html> Best Regards, Jerry > On Oct 28, 2015, a

Re: [Spark SQL]: Spark Job Hangs on the refresh method when saving over 1 million files

2015-10-26 Thread Jerry Lam
of partition is over 100. Best Regards, Jerry Sent from my iPhone > On 26 Oct, 2015, at 2:50 am, Fengdong Yu <fengdo...@everstring.com> wrote: > > How many partitions you generated? > if Millions generated, then there is a huge memory consumed. > > > > > >&

Spark SQL: Issues with using DirectParquetOutputCommitter with APPEND mode and OVERWRITE mode

2015-10-22 Thread Jerry Lam
? Best Regards, Jerry

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Jerry Lam
. it takes awhile to initialize the partition table and it requires a lot of memory from the driver. I would not use it if the number of partition go over a few hundreds. Hope this help, Jerry Sent from my iPhone > On 28 Oct, 2015, at 6:33 pm, Bryan <bryan.jeff...@gmail.com> wrote: &

Re: Very slow startup for jobs containing millions of tasks

2015-11-14 Thread Jerry Lam
Hi Ted, That looks exactly what happens. It has been 5 hrs now. The code was built for 1.4. Thank you very much! Best Regards, Jerry Sent from my iPhone > On 14 Nov, 2015, at 11:21 pm, Ted Yu <yuzhih...@gmail.com> wrote: > > Which release are you using ? > If older th

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
r. the max-date is likely > to be faster though. > > On Sun, Nov 1, 2015 at 4:36 PM, Jerry Lam <chiling...@gmail.com> wrote: > >> Hi Koert, >> >> You should be able to see if it requires scanning the whole data by >> "explain" the query. The physica

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
Hi Koert, You should be able to see if it requires scanning the whole data by "explain" the query. The physical plan should say something about it. I wonder if you are trying the distinct-sort-by-limit approach or the max-date approach? Best Regards, Jerry On Sun, Nov 1, 2015

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
of the physical plan, you can navigate the actual execution in the web UI to see how much data is actually read to satisfy this request. I hope it only requires a few bytes for few dates. Best Regards, Jerry On Sun, Nov 1, 2015 at 5:56 PM, Jerry Lam <chiling...@gmail.com> wrote: > I agreed the

Re: spark sql partitioned by date... read last date

2015-11-01 Thread Jerry Lam
s actually works or not. :) Best Regards, Jerry On Sun, Nov 1, 2015 at 3:03 PM, Koert Kuipers <ko...@tresata.com> wrote: > hello all, > i am trying to get familiar with spark sql partitioning support. > > my data is partitioned by date, so like this: > data/date=2015-01-01 >

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
spark.sql.hive.enabled false configuration would be lovely too. :) Just an additional bonus is that it requires less memory if we don’t use HiveContext on the driver side (~100-200MB) from a rough observation. Thanks and have a nice weekend! Jerry > On Nov 6, 2015, at 5:53 PM, Ted Yu <yuzhih...@gma

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
. /home/jerry directory). It will give me an exception like below. Since I don’t use HiveContext, I don’t see the need to maintain a database. What is interesting is that pyspark shell is able to start more than 1 session at the same time. I wonder what pyspark has done better than spark-shell

Re: Please reply if you use Mesos fine grained mode

2015-11-03 Thread Jerry Lam
We "used" Spark on Mesos to build interactive data analysis platform because the interactive session could be long and might not use Spark for the entire session. It is very wasteful of resources if we used the coarse-grained mode because it keeps resource for the entire session. Therefore,

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jerry Lam
Does Qubole use Yarn or Mesos for resource management? Sent from my iPhone > On 5 Nov, 2015, at 9:02 pm, Sabarish Sasidharan > wrote: > > Qubole - To unsubscribe, e-mail:

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
) at org.apache.derby.jdbc.Driver20.connect(Unknown Source) at org.apache.derby.jdbc.AutoloadedDriver.connect(Unknown Source) at java.sql.DriverManager.getConnection(DriverManager.java:571) Best Regards, Jerry > On Nov 6, 2015, at 12:12 PM, Ted Yu <yuzhih...@gmail.com&

[Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
ply$mcZ$sp(SparkILoopExt.scala:127) at org.apache.spark.repl.SparkILoopExt$$anonfun$process$1.apply(SparkILoopExt.scala:113) at org.apache.spark.repl.SparkILoopExt$$anonfun$process$1.apply(SparkILoopExt.scala:113) Best Regards, Jerry

Re: [Spark-SQL]: Disable HiveContext from instantiating in spark-shell

2015-11-06 Thread Jerry Lam
onfig of skipping the above call. > > FYI > > On Fri, Nov 6, 2015 at 8:53 AM, Jerry Lam <chiling...@gmail.com > <mailto:chiling...@gmail.com>> wrote: > Hi spark users and developers, > > Is it possible to disable HiveContext from being instantiated when usin

Re: Indexing Support

2015-10-18 Thread Jerry Lam
I'm interested in it but I doubt there is r-tree indexing support in the near future as spark is not a database. You might have a better luck looking at databases with spatial indexing support out of the box. Cheers Sent from my iPad On 2015-10-18, at 17:16, Mustafa Elbehery

  1   2   >