Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Takeshi Yamamuro
Hi, Mich Did you check the URL Josh referred to?; the cast for string comparisons is needed for accepting `c_date >= "2016"`. // maropu On Fri, Apr 15, 2016 at 10:30 AM, Hyukjin Kwon wrote: > Hi, > > > String comparison itself is pushed down fine but the problem is to

Re: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Takeshi Yamamuro
Hi, Seems you cannot use reserved words (e.g., sum and avg) in the Spark SQL parser because an input string in filter is processed by the parser inside. // maropu On Thu, Apr 14, 2016 at 11:14 PM, wrote: > Appreciated Michael, but this doesn’t help my case, the

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Takeshi Yamamuro
Hi, How about checking Spark survey result 2015 in https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html for the statistics? // maropu On Fri, Apr 15, 2016 at 4:52 AM, Mark Hamstra wrote: > That's also available in standalone. > > On

Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Takeshi Yamamuro
Hi, See SPARK-2883 for ORC supports. // maropu On Fri, Apr 15, 2016 at 11:22 AM, Ted Yu wrote: > For Parquet, please take a look at SPARK-1251 > > For ORC, not sure. > Looking at git history, I found ORC mentioned by

Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Ted Yu
For Parquet, please take a look at SPARK-1251 For ORC, not sure. Looking at git history, I found ORC mentioned by SPARK-1368 FYI On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli wrote: > I am needing this fact for the research paper I am writing right now. > > When did Spark

decline offer timeout

2016-04-14 Thread Rodrick Brown
I have hundreds of small spark jobs running on my Mesos cluster causing starvation to other frameworks like Marathon on my cluster. Is their a way to prevent these frameworks from getting offers so often? Apr 15 02:00:12 prod-mesos-m-3.$SERVER.com mesos-master[10259]: I0415

When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Edmon Begoli
I am needing this fact for the research paper I am writing right now. When did Spark start supporting Parquet and when ORC? (what release) I appreciate any info you can offer. Thank you, Edmon

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Hyukjin Kwon
Hi, String comparison itself is pushed down fine but the problem is to deal with Cast. It was pushed down before but is was reverted, ( https://github.com/apache/spark/pull/8049). Several fixes were tried here, https://github.com/apache/spark/pull/11005 and etc. but there were no changes to

Re: spark-ec2 hitting yum install issues

2016-04-14 Thread Nicholas Chammas
If you log into the cluster and manually try that step does it still fail? Can you yum install anything else? You might want to report this issue directly on the spark-ec2 repo, btw: https://github.com/amplab/spark-ec2 Nick On Thu, Apr 14, 2016 at 9:08 PM sanusha

spark-ec2 hitting yum install issues

2016-04-14 Thread sanusha
I am using spark-1.6.1-prebuilt-with-hadoop-2.6 on mac. I am using the spark-ec2 to launch a cluster in Amazon VPC. The setup.sh script [run first thing on master after launch] uses pssh and tries to install it via 'yum install -y pssh'. This step always fails on the master AMI that the script

Re: Spark/Parquet

2016-04-14 Thread Hyukjin Kwon
Currently Spark uses Parquet 1.7.0 (parquet-mr). If you meant writer version2 (parquet-format), you can specify this by manually setting as below: sparkContext.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION, ParquetProperties.WriterVersion.PARQUET_2_0.toString) 2016-04-15 2:21

Re: how to write pyspark interface to scala code?

2016-04-14 Thread Holden Karau
Its a bit tricky - if the users data is represented in a DataFrame or Dataset then its much easier. Assuming that the function is going to be called from the driver program (e.g. not inside of a transformation or action) then you can use the Py4J context to make the calls. You might find looking

Re: error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-14 Thread Holden Karau
The org.apache.spark.sql.execution.EvaluatePython.takeAndServe exception can happen in a lot of places it might be easier to figure out if you have a code snippet you can share where this is occurring? On Wed, Apr 13, 2016 at 2:27 PM, AlexModestov wrote: > I get

Re: JSON Usage

2016-04-14 Thread Holden Karau
You could certainly use RDDs for that, you might also find using Dataset selecting the fields you need to construct the URL to fetch and then using the map function to be easier. On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim wrote: > I was wonder what would be the best way

Re: Error with --files

2016-04-14 Thread Benjamin Zaitlen
That fixed it! Thank you! --Ben On Thu, Apr 14, 2016 at 5:53 PM, Marcelo Vanzin wrote: > On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen > wrote: > >> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files > >>

Re: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-14 Thread Andrei
Yes, I tried setting YARN_CONF_DIR, but with no luck. I will play around with environment variables and system properties and post back in case of success. Thanks for your help so far! On Thu, Apr 14, 2016 at 5:48 AM, Sun, Rui wrote: > In SparkSubmit, there is less work for

Re: Error with --files

2016-04-14 Thread Marcelo Vanzin
On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen wrote: >> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files >> /home/ubuntu/localtest.txt#appSees.txt --files should come before the path to your python script. Otherwise it's just passed as arguments to

Re: Error with --files

2016-04-14 Thread Ted Yu
bq. localtest.txt#appSees.txt Which file did you want to pass ? Thanks On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen wrote: > Hi All, > > I'm trying to use the --files option with yarn: > > spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files >>

Adding metadata information to parquet files

2016-04-14 Thread Manivannan Selvadurai
Hi All, I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2 and I'm looking for a way to store the source schema in the parquet file like the way you get to store the avro schema as a metadata info when using the AvroParquetWriter. Any help much appreciated.

Can this performance be improved?

2016-04-14 Thread Bibudh Lahiri
Hi, As part of a larger program, I am extracting the distinct values of some columns of an RDD with 100 million records and 4 columns. I am running Spark in standalone cluster mode on my laptop (2.3 GHz Intel Core i7, 10 GB 1333 MHz DDR3 RAM) with all the 8 cores given to a single worker. So

Re: Spark replacing Hadoop

2016-04-14 Thread Mich Talebzadeh
One can see from the responses that Big Data landscape is getting very crowded with tools and there are dozens of alternatives offered. However, as usual the laws of selection will gravitate towards solutions that are scalable, reliable and more importantly cost effective. To this end any

Error with --files

2016-04-14 Thread Benjamin Zaitlen
Hi All, I'm trying to use the --files option with yarn: spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files > /home/ubuntu/localtest.txt#appSees.txt I never see the file in HDFS or in the yarn containers. Am I doing something incorrect ? I'm running spark 1.6.0 Thanks,

Re: Spark replacing Hadoop

2016-04-14 Thread Peyman Mohajerian
Cloud adds another dimension: The fact that in cloud compute and storage is decoupled, s3-emr or blob-hdisight, means in cloud Hadoop ends up being more of a compute engine and a lot of the governance, security features are irrelevant or less important because data at rest is out of Hadoop.

Re: Spark replacing Hadoop

2016-04-14 Thread Cody Koeninger
I've been using spark for years and have (thankfully) been able to avoid needing HDFS, aside from one contract where it was already in use. At this point, many of the people I know would consider Kafka to be more important than HDFS. On Thu, Apr 14, 2016 at 3:11 PM, Jörn Franke

Re: Spark replacing Hadoop

2016-04-14 Thread Jörn Franke
I do not think so. Hadoop provides an ecosystem in which you can deploy different engines, such as MR, HBase, TEZ, Spark, Flink, titandb, hive, solr... I observe also that commercial analytical tools use one or more of these engines to execute their code in a distributed fashion. You need this

Re: Spark replacing Hadoop

2016-04-14 Thread Sean Owen
Depends indeed on what you mean by "Hadoop". The core Hadoop project is MapReduce, YARN and HDFS. MapReduce is still in use as a workhorse but superseded by engines like Spark (or perhaps Flink). (Tez maps loosely to Spark Core really, and is not really a MapReduce replacement.) "Hadoop" can

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mark Hamstra
That's also available in standalone. On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov wrote: > Spark on Yarn supports dynamic resource allocation > > So, you can run several spark-shells / spark-submits / spark-jobserver / > zeppelin on one cluster without defining

Client process memory usage

2016-04-14 Thread Nisrina Luthfiyati
Hi all, I have a python Spark application that I'm running using spark-submit in yarn-cluster mode. If I run ps -aux | grep in the submitter node, I can find the client process that submitted the application, usually with around 300-600 MB memory use (%MEM around 1.0-2.0 in a node with 30 GB

Re: Spark replacing Hadoop

2016-04-14 Thread Arunkumar Chandrasekar
Hello, I would stand in side of Spark. Spark provides numerous add-ons like Spark SQL, Spark MLIB that are possibly something hard to set it up with Map Reduce. Thank You. > On Apr 15, 2016, at 1:16 AM, Ashok Kumar wrote: > > Hello, > > Well, Sounds like

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
Spark on Yarn supports dynamic resource allocation So, you can run several spark-shells / spark-submits / spark-jobserver / zeppelin on one cluster without defining upfront how many executors / memory you want to allocate to each app Great feature for regular users who just want to run Spark /

Re: Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
Hello, Well, Sounds like Andy is implying that Spark can replace Hadoop whereas Mich still believes that HDFS is a keeper? thanks On Thursday, 14 April 2016, 20:40, David Newberger wrote: #yiv4514430231 #yiv4514430231 -- _filtered #yiv4514430231

Re: Spark replacing Hadoop

2016-04-14 Thread Felipe Gustavo
Hi Ashok, In my opinion, we should look at Hadoop as a general purpose Framework that supports multiple models and we should look at Spark as an alternative to Hadoop MapReduce rather than a replacement to Hadoop ecosystem (for instance, Spark is not replacing Zookeper, HDFS, etc) Regards On

RE: Spark replacing Hadoop

2016-04-14 Thread David Newberger
Can we assume your question is “Will Spark replace Hadoop MapReduce?” or do you literally mean replacing the whole of Hadoop? David From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] Sent: Thursday, April 14, 2016 2:13 PM To: User Subject: Spark replacing Hadoop Hi, I hear that some

Re: Spark replacing Hadoop

2016-04-14 Thread Mich Talebzadeh
Hi, My two cents here. Hadoop as I understand has two components namely HDFS (Hadoop Distributed File System) and MapReduce. Whatever we use I still think we need to store data on HDFS (excluding standalones like MongoDB etc.). Now moving to MapReduce as the execution engine that is replaced by

Re: Spark replacing Hadoop

2016-04-14 Thread Andy Davidson
Hi Ashok In general if I was starting a new project and had not invested heavily in hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of existing projects implemented on hadoop, Š) I would probably start using spark. Its faster and easier to use Your mileage may vary Andy

Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
Hi, I hear that some saying that Hadoop is getting old and out of date and will be replaced by Spark! Does this make sense and if so how accurate is it? Best

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Sean Owen
I don't think usage is the differentiating factor. YARN and standalone are pretty well supported. If you are only running a Spark cluster by itself with nothing else, standalone is probably simpler than setting up YARN just for Spark. However if you're running on a cluster that will host other

JSON Usage

2016-04-14 Thread Benjamin Kim
I was wonder what would be the best way to use JSON in Spark/Scala. I need to lookup values of fields in a collection of records to form a URL and download that file at that location. I was thinking an RDD would be perfect for this. I just want to hear from others who might have more experience

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Mich Talebzadeh
Hi Josh, Can you please clarify whether date comparisons as two strings work at all? I was under the impression is that with string comparison only first characters are compared? Thanks Dr Mich Talebzadeh LinkedIn *

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Josh Rosen
AFAIK this is not being pushed down because it involves an implicit cast and we currently don't push casts into data sources or scans; see https://github.com/databricks/spark-redshift/issues/155 for a possibly-related discussion. On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh

Re: EMR Spark log4j and metrics

2016-04-14 Thread Peter Halliday
An update to this is that I can see the log4j.properties files and the metrics.properties files correctly on the master. When I submit a Spark Step that runs Spark in deploy mode of cluster, I see the cluster files being zipped up and pushed via hdfs to the driver and workers. However, I don't

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mich Talebzadeh
Hi Alex, Do you mean using Spark with Yarn-client compared to using Spark Local? HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
AWS EMR includes Spark on Yarn Hortonworks and Cloudera platforms include Spark on Yarn as well On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz wrote: > Hello, > > Is there any statistics regarding YARN vs Standalone Spark Usage in > production ? > > I would like to

Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Mich Talebzadeh
Are you comparing strings in here or timestamp? Filter ((cast(registration#37 as string) >= 2015-05-28) && (cast(registration#37 as string) <= 2015-05-29)) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Spark/Parquet

2016-04-14 Thread Younes Naguib
Hi all, When parquet 2.0 planned in Spark? Or is it already? Younes Naguib Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC H3G 1R8 Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.nag...@tritondigital.com

Spark sql not pushing down timestamp range queries

2016-04-14 Thread Kiran Chitturi
Hi, Timestamp range filter queries in SQL are not getting pushed down to the PrunedFilteredScan instances. The filtering is happening at the Spark layer. The physical plan for timestamp range queries is not showing the pushed filters where as range queries on other types is working fine as the

Re: Sqoop on Spark

2016-04-14 Thread Mich Talebzadeh
Hi, "SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 threads, that is 246 GB of data." Could you please give the source of the database and where was it (on the same host as Hive or another host). thanks Dr Mich Talebzadeh LinkedIn *

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-04-14 Thread Teng Qiu
forward you this mails, hope these can help you, you can take a look at this post http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html 2016-03-04 3:30 GMT+01:00 Divya Gehlot : > Hi Teng, > > Thanks for the link you shared , helped me figure out the

Re: Sqoop on Spark

2016-04-14 Thread Jörn Franke
They wanted to have alternatives. I recommended the original approach of simply using sqoop. > On 14 Apr 2016, at 16:09, Gourav Sengupta wrote: > > Hi, > > SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 > threads, that is 246 GB of

Exposing temp table via Hive Thrift server

2016-04-14 Thread ram kumar
Hi, In spark-shell (scala), we import, *org.apache.spark.sql.hive.thriftserver._* for starting Hive Thrift server programatically for particular hive context as *HiveThriftServer2.startWithContext(hiveContext)* to expose registered temp table for that particular session. We used pyspark for

YARN vs Standalone Spark Usage in production

2016-04-14 Thread Arkadiusz Bicz
Hello, Is there any statistics regarding YARN vs Standalone Spark Usage in production ? I would like to choose most supported and used technology in production for our project. BR, Arkadiusz Bicz - To unsubscribe, e-mail:

RE: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Saif.A.Ellafi
Appreciated Michael, but this doesn’t help my case, the filter string is being submitted from outside my program, is there any other alternative? some literal string parser or anything I can do before? Saif From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, April 13, 2016

Re: Sqoop on Spark

2016-04-14 Thread Gourav Sengupta
Hi, SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 threads, that is 246 GB of data. Why is the discussion about using anything other than SQOOP still so wonderfully on? Regards, Gourav On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke wrote: >

Re: Spark Yarn closing sparkContext

2016-04-14 Thread Ted Yu
Can you pastebin the failure message ? Did you happen to take jstack during the close ? Which Hadoop version do you use ? Thanks > On Apr 14, 2016, at 5:53 AM, nihed mbarek wrote: > > Hi, > I have an issue with closing my application context, the process take a long >

Spark streaming applicaiton don't generate Jobs after run a week ,At last,it throw oom exeception

2016-04-14 Thread yuemeng (A)
@All There is a strange problem,I had been running a spark streaming application for long time,follow is the application info: 1) Fetch data from kafka use dricet api 2) Use sql to write each rdd data of Dstream into redis 3) Read data from redis Everything seems ok during

Spark Yarn closing sparkContext

2016-04-14 Thread nihed mbarek
Hi, I have an issue with closing my application context, the process take a long time with a fail at the end. In other part, my result was generate in the write folder and _SUCESS file was created. I'm using spark 1.6 with yarn. any idea ? regards, -- MBAREK Med Nihed, Fedora Ambassador,

Spark MLib LDA Example

2016-04-14 Thread Amit Singh Hora
Hi All, I am very new to Spark-MLib .I am trying to understand and implement Spark Mlib's LDA algorithm Goal is to get Topic present documents given and terms with in those topics . I followed below link https://gist.github.com/jkbradley/ab8ae22a8282b2c8ce33

New syntax error in Spark with for loop

2016-04-14 Thread raghunathr85
I was using Spark 1.2.x earlier and my PySpark worked well with that version. When I upgraded into Spark 1.5.0. I am getting Syntax Error, Same code which worked Earlier. Dont know where is the issue. .map(lambda (k,v): (k,list(set(v.map(lambda (k,v): (k,{v2:i for i, v2 in

Re: Memory needs when using expensive operations like groupBy

2016-04-14 Thread Takeshi Yamamuro
Hi, You should not directly use these JVM options, and you can use `spark.executor.memory` and `spark.driver.memory` for the optimization. // maropu On Thu, Apr 14, 2016 at 11:32 AM, Divya Gehlot wrote: > Hi, > I am using Spark 1.5.2 with Scala 2.10 and my Spark job

executor running time vs getting result from jupyter notebook

2016-04-14 Thread Patcharee Thongtra
Hi, I am running a jupyter notebook - pyspark. I noticed from the history server UI there are some tasks spending a lot of time on either - executor running time - getting result But some tasks finished both steps very quick. All tasks however have very similar input size. What can be the

Re: Spark 1.6.0 - token renew failure

2016-04-14 Thread Marcelo Vanzin
You can set "spark.yarn.security.tokens.hive.enabled=false" in your config, although your app won't work if you actually need Hive delegation tokens. On Thu, Apr 14, 2016 at 12:21 AM, Luca Rea wrote: > Hi Jeff, > > > > Thank you for your support, I’ve removed

Spark streaming time displayed is not current system time but it is processing current messages

2016-04-14 Thread Hemalatha A
Hi, I am facing a problem in Spark streaming. The time displayed in Spark streaming console is 4 days prior i.e., April 10th, which is not current system time of the cluster but the job is processing current messages that is pushed right now April 14th. Can anyone please advice what time does