Spark 2.0 on HDP

2016-10-27 Thread Deenar Toraskar
Hi Has anyone tried running Spark 2.0 on HDP. I have managed to get around the issues with the timeline service (by turning it off), but now am stuck when the YARN cannot find org.apache.spark.deploy.yarn.ExecutorLauncher. Error: Could not find or load main class

Re: Spark SQL partitioned tables - check for partition

2016-02-25 Thread Deenar Toraskar
> On Thu, Feb 25, 2016 at 9:24 AM, Deenar Toraskar < > deenar.toras...@gmail.com> wrote: > >> Hi >> >> How does one check for the presence of a partition in a Spark SQL >> partitioned table (save using dataframe.write.partitionedBy("partCol") not &

Spark SQL partitioned tables - check for partition

2016-02-25 Thread Deenar Toraskar
Hi How does one check for the presence of a partition in a Spark SQL partitioned table (save using dataframe.write.partitionedBy("partCol") not hive compatible tables), other than physically checking the directory on HDFS or doing a count(*) with the partition cols in the where clause ?

Re: How to control the number of files for dynamic partition in Spark SQL?

2016-01-30 Thread Deenar Toraskar
The following should work as long as your tables are created using Spark SQL event_wk.repartition(2).write.partitionBy("eventDate").format("parquet" ).insertInto("event) If you want to stick to using "insert overwrite" for Hive compatibility, then you can repartition twice, instead of setting

Re: Spark 2.0.0 release plan

2016-01-29 Thread Deenar Toraskar
A related question. Are the plans to move the default Spark builds to Scala 2.11 with Spark 2.0? Regards Deenar On 27 January 2016 at 19:55, Michael Armbrust wrote: > We do maintenance releases on demand when there is enough to justify doing > one. I'm hoping to cut

spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-01-28 Thread Deenar Toraskar
Hi Anyone tried using spark-xml with spark 1.6. I cannot even get the sample book.xml file (wget https://github.com/databricks/spark-xml/raw/master/src/test/resources/books.xml ) working https://github.com/databricks/spark-xml scala> val df =

Re: how to run latest version of spark in old version of spark in cloudera cluster ?

2016-01-27 Thread Deenar Toraskar
Sri Look at the instructions here. They are for 1.5.1, but should also work for 1.6 https://www.linkedin.com/pulse/running-spark-151-cdh-deenar-toraskar-cfa?trk=hp-feed-article-title-publish=true=true Deenar On 27 January 2016 at 20:16, Koert Kuipers <ko...@tresata.com> wrote: >

Re: hivethriftserver2 problems on upgrade to 1.6.0

2016-01-27 Thread Deenar Toraskar
James The problem you are facing is due to a feature introduced in Spark 1.6 - multi-session mode, if you want to see temporary tables across session, *set spark.sql.hive.thriftServer.singleSession=true* - From Spark 1.6, by default the Thrift server runs in multi-session mode. Which

Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
Hi I am using a shared sparkContext for all of my Spark jobs. Some of the jobs use HiveContext, but there isn't a getOrCreate method on HiveContext which will allow reuse of an existing HiveContext. Such a method exists on SQLContext only (def getOrCreate(sparkContext: SparkContext): SQLContext).

Re: Sharing HiveContext in Spark JobServer / getOrCreate

2016-01-25 Thread Deenar Toraskar
On 25 January 2016 at 21:09, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > No I hadn't. This is useful, but in some cases we do want to share the > same temporary tables between jobs so really wanted a getOrCreate > equivalent on HIveContext. > > Deenar

Generic Dataset Aggregator

2016-01-25 Thread Deenar Toraskar
Hi All https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html I have been converting my UDAFs to Dataset (Dataset's are cool BTW) Aggregators. I have an ArraySum aggregator that does an element wise sum or arrays. I have got the simple version working, but

Re: Concatenating tables

2016-01-23 Thread Deenar Toraskar
On 23 Jan 2016 9:18 p.m., "Deenar Toraskar" < deenar.toras...@thinkreactive.co.uk> wrote: > Df.UnionAll(df2).unionall (df3) > On 23 Jan 2016 9:02 p.m., "Andrew Holway" <andrew.hol...@otternetworks.de> > wrote: > >> Is there a data frame op

Error connecting to temporary derby metastore used by Spark, when running multiple jobs on the same SparkContext

2016-01-13 Thread Deenar Toraskar
Hi I am using the spark-jobserver and see the following messages when a lot of jobs are submitted simultaneously to the same SparkContext. Any ideas as to what might cause this. [2016-01-13 13:12:11,753] ERROR com.jolbox.bonecp.BoneCP [] [akka://JobServer/user/context-supervisor/ingest-context]

distributeBy using advantage of HDFS or RDD partitioning

2016-01-13 Thread Deenar Toraskar
Hi I have data in HDFS partitioned by a logical key and would like to preserve the partitioning when creating a dataframe for the same. Is it possible to create a dataframe that preserves partitioning from HDFS or the underlying RDD? Regards Deenar

Re: Spark SQL UDF with Struct input parameters

2016-01-13 Thread Deenar Toraskar
I have raised a JIRA to cover this https://issues.apache.org/jira/browse/SPARK-12809 On 13 January 2016 at 16:05, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > Frank > > Sorry got my wires crossed, I had come across another issue. Now I > remember this

Re: 101 question on external metastore

2016-01-07 Thread Deenar Toraskar
you think it's from > different versions of Derby? I was playing with this as a fun experiment > and my setup was on a clean machine -- no other versions of > hive/hadoop/etc... > > On Sun, Dec 20, 2015 at 12:17 AM, Deenar Toraskar < > deenar.toras...@gmail.com> wrote: > &

Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2016-01-06 Thread Deenar Toraskar
Hi guys 1. >> Add this jar to the classpath of all NodeManagers in your cluster. A related question on configuration of the auxillary shuffle service. *How do i find the classpath for NodeManager?* I tried finding all places where the existing mapreduce shuffle jars are present and place

Re: Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
uped .groupBy("item1") I found another example above, but I cant seem to figure out what this does? val expanded = dataframe .explode[::[Long], Long]("items", "item1")(row => row) .explode[::[Long], Long]("items", "item2")(row => row) O

Spark SQL dataframes explode /lateral view help

2016-01-05 Thread Deenar Toraskar
Hi All I have the following spark sql query and would like to use convert this to use the dataframes api (spark 1.6). The eee, eep and pfep are all maps of (int -> float) select e.counterparty, epe, mpfe, eepe, noOfMonthseep, teee as effectiveExpectedExposure, teep as expectedExposure , tpfep

Re: Spark SQL UDF with Struct input parameters

2015-12-25 Thread Deenar Toraskar
lus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) On 26 December 2015 at 02:42, Deenar Toraskar &

Spark SQL UDF with Struct input parameters

2015-12-25 Thread Deenar Toraskar
Hi I am trying to define an UDF that can take an array of tuples as input def effectiveExpectedExposure(expectedExposures: Seq[(Seq[Float], Seq[Float])])= expectedExposures.map(x=> x._1 * x._2).sum/expectedExposures.map(x=> x._1).sum sqlContext.udf.register("expectedPositiveExposure",

Re: 101 question on external metastore

2015-12-19 Thread Deenar Toraskar
apparently it is down to different versions of derby in the classpath, but i am unsure where the other version is coming from. The setup worked perfectly with spark 1.3.1. Deenar On 20 December 2015 at 04:41, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > Hi Yana/All > &g

Re: 101 question on external metastore

2015-12-19 Thread Deenar Toraskar
Hi Yana/All I am getting the same exception. Did you make any progress? Deenar On 5 November 2015 at 17:32, Yana Kadiyska wrote: > Hi folks, trying experiment with a minimal external metastore. > > I am following the instructions here: >

Re: [Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space -- Work in 1.4, but 1.5 doesn't

2015-12-15 Thread Deenar Toraskar
On 16 December 2015 at 06:19, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > Hi > > I had the same problem. There is a query with a lot of small tables (5x) > all below the broadcast threshold and Spark is broadcasting all these > tables toge

Re: [Spark 1.5]: Exception in thread "broadcast-hash-join-2" java.lang.OutOfMemoryError: Java heap space -- Work in 1.4, but 1.5 doesn't

2015-12-15 Thread Deenar Toraskar
Hi I have created an issue for this https://issues.apache.org/jira/browse/SPARK-12358 Regards Deenar On 16 December 2015 at 06:21, Deenar Toraskar <deenar.toras...@gmail.com> wrote: > > > On 16 December 2015 at 06:19, Deenar Toraskar < > deenar.toras...@thinkreactive.

Re: hive thriftserver and fair scheduling

2015-12-08 Thread Deenar Toraskar
Thanks Michael, I'll try it out. Another quick/important question: How do I make udfs available to all of the hive thriftserver users? Right now, when I launch a spark-sql client, I notice that it reads the ~/.hiverc file and all udfs get picked up but this doesn't seem to be working in hive

Re: Can't create UDF's in spark 1.5 while running using the hive thrift service

2015-12-08 Thread Deenar Toraskar
Hi Trystan I am facing the same issue. It only appears with the thrift server, the same call works fine via the spark-sql shell. Do you have any workarounds and have you filed a JIRA/bug for the same? Regards Deenar On 12 October 2015 at 18:01, Trystan Leftwich wrote: >

Re: Dataset and lambas

2015-12-07 Thread Deenar Toraskar
/docs/latest/mllib-data-types.html Regards Deenar Regards Deenar On 7 December 2015 at 20:21, Michael Armbrust <mich...@databricks.com> wrote: > On Sat, Dec 5, 2015 at 3:27 PM, Deenar Toraskar <deenar.toras...@gmail.com > > wrote: >> >> On a similar note, what

Re: SparkSQL AVRO

2015-12-07 Thread Deenar Toraskar
By default Spark will create one file per partition. Spark SQL defaults to using 200 partitions. If you want to reduce the number of files written out, repartition your dataframe using repartition and give it the desired number of partitions.

Spark SQL - saving to multiple partitions in parallel - FileNotFoundException on _temporary directory possible bug?

2015-12-07 Thread Deenar Toraskar
Hi I have a process that writes to multiple partitions of the same table in parallel using multiple threads sharing the same SQL context df.write.partitionedBy("partCol").insertInto("tableName") . I am getting FileNotFoundException on _temporary directory. Each write only goes to a single

Re: Dataset and lambas

2015-12-05 Thread Deenar Toraskar
Hi Michael On a similar note, what is involved in getting native support for some user defined functions, so that they are as efficient as native Spark SQL expressions? I had one particular one - an arraySum (element wise sum) that is heavily used in a lot of risk analytics. Deenar On 5

Spark Streaming BackPressure and Custom Receivers

2015-12-03 Thread Deenar Toraskar
Hi I was going through the Spark Streaming BackPressure feature documentation and wanted to understand how I can ensure my custom receiver is able to handle rate limiting. I have a custom receiver similar to the TwitterInputDStream, but there is no obvious way to throttle what is being read from

Re: Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

2015-11-22 Thread Deenar Toraskar
Thanks Michael Thanks for the response. Here is my understanding, correct me if I am wrong 1) Spark SQL written partitioned tables do not write metadata to the Hive metastore. Spark SQL discovers partitions from the table location on the underlying DFS, and not the metastore. It does this the

Spark-SQL idiomatic way of adding a new partition or writing to Partitioned Persistent Table

2015-11-21 Thread Deenar Toraskar
Hi guys Is it possible to add a new partition to a persistent table using Spark SQL ? The following call works and data gets written in the correct directories, but no partition metadata is not added to the Hive metastore. In addition I see nothing preventing any arbitrary schema being appended

Unable to load native-hadoop library for your platform - already loaded in another classloader

2015-11-18 Thread Deenar Toraskar
Hi I want to make sure we use short-circuit local reads for performance. I have set the LD_LIBRARY_PATH correctly, confirmed that the native libraries match our platform (i.e. are 64 bit and are loaded successfully). When I start spark, i get the following message after increasing the logging

Avro RDD to DataFrame

2015-11-16 Thread Deenar Toraskar
Hi The spark-avro module supports creation of a DataFrame from avro files. How can convert a RDD of Avro objects that i get via SparkStreaming into a DataFrame? val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](..)

spark sql "create temporary function" scala functions

2015-11-15 Thread Deenar Toraskar
Hi I wanted to know how to go about registering scala functions as UDFs using spark sql create temporary function statement. Currently I do the following /* convert prices to holding period returns */ object VaR extends Serializable { def returns(prices :Seq[Double], horizon: Integer) :

Re: Re: Spark RDD cache persistence

2015-11-05 Thread Deenar Toraskar
You can have a long running Spark context in several fashions. This will ensure your data will be cached in memory. Clients will access the RDD through a REST API that you can expose. See the Spark Job Server, it does something similar. It has something called Named RDDs Using Named RDDs Named

Re: How to lookup by a key in an RDD

2015-11-02 Thread Deenar Toraskar
Swetha Currently IndexedRDD is an external library and not part of Spark Core. You can use it by adding a dependency and pull it in. There are plans to move it to Spark core tracked in https://issues.apache.org/jira/browse/SPARK-2365. See

Re: execute native system commands in Spark

2015-11-02 Thread Deenar Toraskar
You can do the following, make sure you the no of executors requested equal the number of executors on your cluster. import scala.sys.process._ import org.apache.hadoop.security.UserGroupInformation import org.apache.spark.deploy.SparkHadoopUtil sc.parallelize(0 to 10).map { _

Re: Spark 1.5 on CDH 5.4.0

2015-11-01 Thread Deenar Toraskar
HI guys I have documented the steps involved in getting Spark 1.5.1 run on CDH 5.4.0 here, let me know if it works for you as well https://www.linkedin.com/pulse/running-spark-151-cdh-deenar-toraskar-cfa?trk=hp-feed-article-title-publish Looking forward to CDH 5.5 which supports Spark 1.5.x out

Re: Pulling data from a secured SQL database

2015-10-31 Thread Deenar Toraskar
Thomas I have the same problem, though in my case getting Kerberos authentication to MSSQLServer from the cluster nodes does not seem to be supported. There are a couple of options that come to mind. 1) You can pull the data running sqoop in local mode on the smaller development machines and

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
context.py", > line 660, in _ssql_ctx > "build/sbt assembly", e) > Exception: ("You must build Spark with Hive. Export 'SPARK_HIVE=true' and > run build/sbt assembly", Py4JJavaError(u'An error occurred while calling > None.org.apache.spark.sql.hive.Hiv

Re: nested select is not working in spark sql

2015-10-29 Thread Deenar Toraskar
You can try the following syntax https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries SELECT * FROM A WHERE A.a IN (SELECT foo FROM B); Regards Deenar *Think Reactive Ltd* deenar.toras...@thinkreactive.co.uk 07714140812 On 28 October 2015 at 14:37, Richard Hillegas

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-29 Thread Deenar Toraskar
Hi Bryan For your use case you don't need to have multiple metastores. The default metastore uses embedded Derby . This cannot be shared amongst multiple

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
er -DskipTests clean package > > On Thu, Oct 29, 2015 at 10:16 AM, Deenar Toraskar < > deenar.toras...@gmail.com> wrote: > >> Are you using Spark built with hive ? >> >> # Apache Hadoop 2.4.X with Hive 13 support >> mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
tocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:531) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.ja

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
u, Oct 29, 2015 at 11:14 AM, Deenar Toraskar < > deenar.toras...@gmail.com> wrote: > >> Here is what I did, maybe that will help you. >> >> 1) Downloaded spark-1.5.1 (With HAdoop 2.6.0) spark-1.5.1-bin-hadoop2.6 >> and extracted it on the edge node, set SPAR

Re: Packaging a jar for a jdbc connection using sbt assembly and scala.

2015-10-29 Thread Deenar Toraskar
Hi Dean I guess you are using Spark 1.3. - The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the

Re: No way to supply hive-site.xml in yarn client mode?

2015-10-29 Thread Deenar Toraskar
*Hi Zoltan* Add hive-site.xml to your YARN_CONF_DIR. i.e. $SPARK_HOME/conf/yarn-conf Deenar *Think Reactive Ltd* deenar.toras...@thinkreactive.co.uk 07714140812 On 28 October 2015 at 14:28, Zoltan Fedor wrote: > Hi, > We have a shared CDH 5.3.3 cluster and trying to

Re: Dynamic Resource Allocation with Spark Streaming (Standalone Cluster, Spark 1.5.1)

2015-10-27 Thread Deenar Toraskar
Till Spark Streaming supports dynamic allocation, you could use StreamingListener to monitor batch execution times and based on it sparkContext.requestExecutors() and sparkContext.killExecutors() to add and remove executors explicitly and . On 26 October 2015 at 21:37, Ted Yu

Re: Broadcast table

2015-10-27 Thread Deenar Toraskar
1) if you are using thrift server any cached tables would be cached for all sessions (I am not sure if this was your question) 2) If you want to ensure that the smaller table in the join is replicated to all nodes, you can do the following left.join(broadcast(right), "joinKey") look at this

Re: get directory names that are affected by sc.textFile("path/to/dir/*/*/*.js")

2015-10-27 Thread Deenar Toraskar
This won't work as you can never guarantee which files were read by Spark if some other process is writing files to the same location. It would be far less work to move files matching your pattern to a staging location and then load them using sc.textFile. you should find hdfs file system calls

Re: get host from rdd map

2015-10-26 Thread Deenar Toraskar
1. You can call any api that returns you the hostname in your map function. Here's a simplified example, You would generally use mapPartitions as it will save the overhead of retrieving hostname multiple times 2. 3. import scala.sys.process._ 4. val distinctHosts =

Re: Spark scala REPL - Unable to create sqlContext

2015-10-26 Thread Deenar Toraskar
Embedded Derby, which Hive/Spark SQL uses as the default metastore only supports a single user at a time. Till this issue is fixed, you could use another metastore that supports multiple concurrent users (e.g. networked derby or mysql) to get around it. On 25 October 2015 at 16:15, Ge, Yao (Y.)

Re: Spark 1.5 on CDH 5.4.0

2015-10-23 Thread Deenar Toraskar
o build your > own using the -Pyarn flag. > > -Sandy > > On Thu, Oct 22, 2015 at 9:04 AM, Deenar Toraskar < > deenar.toras...@gmail.com> wrote: > >> Hi I have got the prebuilt version of Spark 1.5 for Hadoop 2.6 ( >> http://www.apache.org/dyn/closer.lua/spark

Re: Best way to use Spark UDFs via Hive (Spark Thrift Server)

2015-10-23 Thread Deenar Toraskar
You can do the following. Start the spark-shell. Register the UDFs in the shell using sqlContext, then start the Thrift Server using startWithContext from the spark shell:

Re: Spark 1.5 on CDH 5.4.0

2015-10-23 Thread Deenar Toraskar
ot; prefix spark.yarn.jar=/opt/spark-1.5.1-bin/... Regards Deenar On 23 October 2015 at 17:30, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > I got this working. For others trying this It turns out in Spark 1.3/CDH5.4 > > spark.yarn.jar=local:/opt/cloudera/parce

Re: Maven Repository Hosting for Spark SQL 1.5.1

2015-10-22 Thread Deenar Toraskar
I can see this artifact in public repos http://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10/1.5.1 http://central.maven.org/maven2/org/apache/spark/spark-sql_2.10/1.5.1/spark-sql_2.10-1.5.1.jar check your proxy settings or the list of repos you are using. Deenar On 22 October 2015

Accessing external Kerberised resources from Spark executors in Yarn client/cluster mode

2015-10-22 Thread Deenar Toraskar
Hi All I am trying to access a SQLServer that uses Kerberos for authentication from Spark. I can successfully connect to the SQLServer from the driver node, but any connections to SQLServer from executors fails with "Failed to find any Kerberos tgt".

Spark 1.5 on CDH 5.4.0

2015-10-22 Thread Deenar Toraskar
Hi I have got the prebuilt version of Spark 1.5 for Hadoop 2.6 ( http://www.apache.org/dyn/closer.lua/spark/spark-1.5.1/spark-1.5.1-bin-hadoop2.6.tgz) working with CDH 5.4.0 in local mode on a cluster with Kerberos. It works well including connecting to the Hive metastore. I am facing an issue

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-20 Thread Deenar Toraskar
This seems to be set using hive.exec.scratchdir, is that set? hdfsSessionPath = new Path(hdfsScratchDirURIString, sessionId); createPath(conf, hdfsSessionPath, scratchDirPermission, false, true); conf.set(HDFS_SESSION_PATH_KEY, hdfsSessionPath.toUri().toString()); On 20 October 2015

Re: can I use Spark as alternative for gem fire cache ?

2015-10-20 Thread Deenar Toraskar
Kali >> can I cache a RDD in memory for a whole day ? as of I know RDD will get empty once the spark code finish executing (correct me if I am wrong). Spark can definitely be used as a replacement for in memory databases for certain use cases. Spark RDDs are not shared amongst contextss. You

Re: Ahhhh... Spark creates >30000 partitions... What can I do?

2015-10-20 Thread Deenar Toraskar
also check out wholeTextFiles https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/SparkContext.html#wholeTextFiles(java.lang.String,%20int) On 20 October 2015 at 15:04, Lan Jiang wrote: > As Francois pointed out, you are encountering a classic small file >

Re: JdbcRDD Constructor

2015-10-20 Thread Deenar Toraskar
is going to implement if do not give any such inputs as > "lowerbound" and "upperbound" to JDBCRDD Constructor or DataSourceAPI > > Thanks in advance for your inputs > > Regards, > Satish Chandra J > > > On Thu, Sep 24, 2015 at 10:18 PM, Deenar Toraskar &l

Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Deenar Toraskar
Reece You can do the following. Start the spark-shell. Register the UDFs in the shell using sqlContext, then start the Thrift Server using startWithContext from the spark shell: https://github.com/apache/spark/blob/master/sql/hive-

Re: How to have Single refernce of a class in Spark Streaming?

2015-10-17 Thread Deenar Toraskar
Swetha Look at http://spark.apache.org/docs/latest/programming-guide.html#shared-variables Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works

Re: Spark SQL running totals

2015-10-15 Thread Deenar Toraskar
you can do a self join of the table with itself with the join clause being a.col1 >= b.col1 select a.col1, a.col2, sum(b.col2) from tablea as a left outer join tablea as b on (a.col1 >= b.col1) group by a.col1, a.col2 I havent tried it, but cant see why it cant work, but doing it in RDD might be

Re: Spark DataFrame GroupBy into List

2015-10-14 Thread Deenar Toraskar
collect_set and collect_list are built-in User Defined functions see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On 14 October 2015 at 03:45, SLiZn Liu wrote: > Hi Michael, > > Can you be more specific on `collect_set`? Is it a built-in function

Re: OutOfMemoryError When Reading Many json Files

2015-10-14 Thread Deenar Toraskar
Hi Why dont you check if you can just process the large file standalone and then do the outer loop next. sqlContext.read.json(jsonFile) .select($"some", $"fields") .withColumn( "new_col", some_transformations($"col")) .rdd.map( x: Row => (k, v) ) .combineByKey() Deenar On 14 October 2015 at

Re: Running in cluster mode causes native library linking to fail

2015-10-13 Thread Deenar Toraskar
Hi Bernardo Is the native library installed on all machines of your cluster and are you setting both the spark.driver.extraLibraryPath and spark.executor.extraLibraryPath ? Deenar On 14 October 2015 at 05:44, Bernardo Vecchia Stein < bernardovst...@gmail.com> wrote: > Hello, > > I am trying

Re: Datastore or DB for spark

2015-10-10 Thread Deenar Toraskar
The choice of datastore is driven by your use case. In fact Spark can work with multiple datastores too. Each datastore is optimised for certain kinds of data. e.g. HDFS is great for analytics and large data sets at rest. It is scalable and very performant, but is immutable. No-SQL databases

Re: HDFS small file generation problem

2015-09-27 Thread Deenar Toraskar
You could try a couple of things a) use Kafka for stream processing, store current incoming events and spark streaming job ouput in Kafka rather than on HDFS and dual write to HDFS too (in a micro batched mode), so every x minutes. Kafka is more suited to processing lots of small events/ b)

Re: JdbcRDD Constructor

2015-09-24 Thread Deenar Toraskar
On 24 September 2015 at 17:48, Deenar Toraskar < deenar.toras...@thinkreactive.co.uk> wrote: > you are interpreting the JDBCRDD API incorrectly. If you want to use > partitions, then the column used to partition and present in the where > clause must be numeric and the lower bound

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Deenar Toraskar
Rafal Check this out https://spark.apache.org/docs/latest/security.html Regards Deenar On 23 September 2015 at 19:13, Rafal Grzymkowski wrote: > Hi, > > I want to enable basic Http authentication for the spark web UI (without > recompilation need for Spark). > I see there is

Re: JdbcRDD Constructor

2015-09-23 Thread Deenar Toraskar
Satish Can you post the SQL query you are using? The SQL query must have 2 placeholders and both of them should be an inclusive range (<= and >=).. e.g. select title, author from books where ? <= id and id <= ? Are you doing this? Deenar On 23 September 2015 at 20:18, Dee

Re: How to turn on basic authentication for the Spark Web

2015-09-23 Thread Deenar Toraskar
Check this out http://lambda.fortytools.com/post/26977061125/servlet-filter-for-http-basic-auth or https://gist.github.com/neolitec/8953607 for examples of filters implementing basic authentication. Implement one of these and set them in the spark.ui.filters property. Deenar On 23 September 2015

Spark 1.5 UDAF ArrayType

2015-09-22 Thread Deenar Toraskar
Hi I am trying to write an UDAF ArraySum, that does element wise sum of arrays of Doubles returning an array of Double following the sample in https://databricks.com/blog/2015/09/16/spark-1-5-dataframe-api-highlights-datetimestring-handling-time-intervals-and-udafs.html. I am getting the

Re: How to share memory in a broadcast between tasks in the same executor?

2015-09-22 Thread Deenar Toraskar
Clement In local mode all worker threads run in the driver VM. Your dictionary should not be copied 32 times, in fact it wont be broadcast at all. Have you tried increasing spark.driver.memory to ensure that the driver uses all the memory on the machine. Deenar On 22 September 2015 at 19:42,

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Deenar Toraskar
Daniel Can you elaborate why are you using a broadcast variable to concatenate many Avro files into a single ORC file. Look at wholetextfiles on Spark context. SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename,

Re: Compute Median in Spark Dataframe

2015-06-22 Thread Deenar Toraskar
poking at the internals is kind of dangerous). On Thu, Jun 4, 2015 at 6:28 AM, Deenar Toraskar deenar.toras...@gmail.com wrote: Hi Holden, Olivier So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals

Re: Anybody using Spark SQL JDBC server with DSE Cassandra?

2015-06-04 Thread Deenar Toraskar
Mohammed Have you tried registering your Cassandra tables in Hive/Spark SQL using the data frames API. These should be then available to query via the Spark SQL/Thrift JDBC Server. Deenar On 1 June 2015 at 19:33, Mohammed Guller moham...@glassbeam.com wrote: Nobody using Spark SQL

Re: Transactional guarantee while saving DataFrame into a DB

2015-06-04 Thread Deenar Toraskar
Hi Tariq You need to handle the transaction semantics yourself. You could for example save from the dataframe to a staging table and then write to the final table using a single atomic INSERT INTO finalTable from stagingTable call. Remember to clear the staging table first to recover from

Re: Adding an indexed column

2015-06-04 Thread Deenar Toraskar
or you could 1) convert dataframe to RDD 2) use mapPartitions and zipWithIndex within each partition 3) convert RDD back to dataframe you will need to make sure you preserve partitioning Deenar On 1 June 2015 at 02:23, ayan guha guha.a...@gmail.com wrote: If you are on spark 1.3, use

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Deenar Toraskar
Hi Holden, Olivier So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. I also need to call a Hive UDAF in a dataframe agg function. Are there any examples of what Column expects? Deenar On 2 June 2015

Re: converting DStream[String] into RDD[String] in spark streaming [I]

2015-03-29 Thread Deenar Toraskar
(saveFunc) } Regards Deenar P.S. The mail archive on nabble does not seem to show all responses. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: 22 March 2015 11:49 To: Deenar Toraskar Cc: user@spark.apache.org Subject: Re: converting DStream[String] into RDD[String