Re: What are factors need to Be considered when upgrading to Spark 2.1.0 from Spark 1.6.0

2017-09-29 Thread Yana Kadiyska
One thing to note, if you are using Mesos, is that the version of Mesos changed from 0.21 to 1.0.0. So taking a newer Spark might push you into larger infrastructure upgrades On Fri, Sep 22, 2017 at 2:39 PM, Gokula Krishnan D wrote: > Hello All, > > Currently our Batch ETL Jobs are in Spark 1.6.

HiveThriftserver does not seem to respect partitions

2017-09-13 Thread Yana Kadiyska
Hi folks, I have created a table in the following manner: CREATE EXTERNAL TABLE IF NOT EXISTS rum_beacon_partition ( list of columns ) COMMENT 'User Infomation' PARTITIONED BY (account_id String, product String, group_id String, year String, month String, day String) STORED AS

Trouble with Thriftserver with hsqldb (Spark 2.1.0)

2017-03-06 Thread Yana Kadiyska
Hi folks, trying to run Spark 2.1.0 thrift server against an hsqldb file and it seems to...hang. I am starting thrift server with: sbin/start-thriftserver.sh --driver-class-path ./conf/hsqldb-2.3.4.jar , completely local setup hive-site.xml is like this: hive.metastore.warehouse.d

[Thriftserver2] Controlling number of tasks

2016-08-03 Thread Yana Kadiyska
Hi folks, I have an ETL pipeline that drops a file every 1/2 hour. When spark reads these files, I end up with 315K tasks for a dataframe reading a few days worth of data. I now with a regular Spark job, I can use coalesce to come to a lower number of tasks. Is there a way to tell HiveThriftserver

Re: 101 question on external metastore

2016-01-14 Thread Yana Kadiyska
e were 2 different version of derby and ensuring > the metastore and spark used the same version of Derby made the problem go > away. > > Deenar > > On 6 January 2016 at 02:55, Yana Kadiyska wrote: > >> Deenar, I have not resolved this issue. Why do you think it's fr

Re: 101 question on external metastore

2016-01-05 Thread Yana Kadiyska
> I am getting the same exception. Did you make any progress? >> >> Deenar >> >> On 5 November 2015 at 17:32, Yana Kadiyska >> wrote: >> >>> Hi folks, trying experiment with a minimal external metastore. >>> >>> I am following the in

Re: HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska
et, can you confirm that? If it fall into this > category, probably you can set the > “spark.sql.thriftServer.incrementalCollect” to false; > > > > Hao > > > > *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.com] > *Sent:* Friday, November 13, 2015 8:30 AM > *To

HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska
Hi folks, I'm starting a HiveServer2 from a HiveContext (HiveThriftServer2.startWithContext(hiveContext)) and then connecting to it via beenline. On the server side, I see the below error which I think is related to https://issues.apache.org/jira/browse/HIVE-6468 But I'd like to know: 1. why I

Re: Subtract on rdd2 is throwing below exception

2015-11-05 Thread Yana Kadiyska
subtract is not the issue. Spark is lazy so a lot of times you'd have many, many lines of code which does not in fact run until you do some action (in your case, subtract). As you can see from the stacktrace, the NPE is from joda which is used in the partitioner (Im suspecting in Cassandra).But the

101 question on external metastore

2015-11-05 Thread Yana Kadiyska
Hi folks, trying experiment with a minimal external metastore. I am following the instructions here: https://cwiki.apache.org/confluence/display/Hive/HiveDerbyServerMode I grabbed Derby 10.12.1.1 and started an instance, verified I can connect via ij tool and that process is listening on 1527 pu

Re: how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
gt; +---+-+---+-+ > |customer_id| uri|browser|epoch| > +---+-+---+-+ > |999|http://foobar|firefox| 1234| > |888|http://foobar| ie|12343| > +---+-+---+-+ > > Cheers > > On Fri, Oct 30, 2015 at 12:11 PM, Ya

how to merge two dataframes

2015-10-30 Thread Yana Kadiyska
Hi folks, I have a need to "append" two dataframes -- I was hoping to use UnionAll but it seems that this operation treats the underlying dataframes as sequence of columns, rather than a map. In particular, my problem is that the columns in the two DFs are not in the same order --notice that my c

Re: Spark -- Writing to Partitioned Persistent Table

2015-10-28 Thread Yana Kadiyska
For this issue in particular ( ERROR XSDB6: Another instance of Derby may have already booted the database /spark/spark-1.4.1/metastore_db) -- I think it depends on where you start your application and HiveThriftserver from. I've run into a similar issue running a driver app first, which would crea

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
.10.jar >> -rw-r--r-- hbase/hadoop339666 2015-10-26 09:52 >> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-api-jdo-3.2.6.jar >> -rw-r--r-- hbase/hadoop 1809447 2015-10-26 09:52 >> spark-1.6.0-SNAPSHOT-bin-custom-spark/lib/datanucleus-rdbms-3.2.9.jar >> &g

Re: Maven build failed (Spark master)

2015-10-26 Thread Yana Kadiyska
In 1.4 ./make_distribution produces a .tgz file in the root directory (same directory that make_distribution is in) On Mon, Oct 26, 2015 at 8:46 AM, Kayode Odeyemi wrote: > Hi, > > The ./make_distribution task completed. However, I can't seem to locate the > .tar.gz file. > > Where does Spark

Re: Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
thank you so much! You are correct. This is the second time I've made this mistake :( On Mon, Oct 26, 2015 at 11:36 AM, java8964 wrote: > Maybe you need the Hive part? > > Yong > > -- > Date: Mon, 26 Oct 2015 11:34:30 -0400 > Subject: Problem with make-distribution.sh

Problem with make-distribution.sh

2015-10-26 Thread Yana Kadiyska
Hi folks, building spark instructions ( http://spark.apache.org/docs/latest/building-spark.html) suggest that ./make-distribution.sh --name custom-spark --tgz -Phadoop-2.4 -Pyarn should produce a distribution similar to the ones found on the "Downloads" page. I noticed that the tgz I built u

Re: SQLcontext changing String field to Long

2015-10-11 Thread Yana Kadiyska
h. We have kind of partitioned our data on the basis of batch_ids folder. > > How did you get around it? > > Thanks for help. :) > > On Sat, Oct 10, 2015 at 7:55 AM, Yana Kadiyska > wrote: > >> can you show the output of df.printSchema? Just a guess but I think I ran >> i

Re: spark-submit hive connection through spark Initial job has not accepted any resources

2015-10-10 Thread Yana Kadiyska
"Job has not accepted resources" is a well-known error message -- you can search the Internet. 2 common causes come to mind: 1) you already have an application connected to the master -- by default a driver will grab all resources so unless that application disconnects, nothing else is allowed to c

Re: SQLcontext changing String field to Long

2015-10-10 Thread Yana Kadiyska
can you show the output of df.printSchema? Just a guess but I think I ran into something similar with a column that was part of a path in parquet. E.g. we had an account_id in the parquet file data itself which was of type string but we also named the files in the following manner /somepath/account

Re: Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska
o check to see if offsets 0 through 100 are still actually > present in the kafka logs. > > On Tue, Sep 22, 2015 at 9:38 AM, Yana Kadiyska > wrote: > >> Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka >> queue into HDFS. Being very new to Kafka, not

Help getting started with Kafka

2015-09-22 Thread Yana Kadiyska
Hi folks, I'm trying to write a simple Spark job that dumps out a Kafka queue into HDFS. Being very new to Kafka, not sure if I'm messing something up on that side...My hope is to read the messages presently in the queue (or at least the first 100 for now) Here is what I have: Kafka side: ./bin/

Re: Sending yarn application logs to web socket

2015-09-07 Thread Yana Kadiyska
Hopefully someone will give you a more direct answer but whenever I'm having issues with log4j I always try -Dlog4j.debug=true.This will tell you which log4j settings are getting picked up from where. I've spent countless hours due to typos in the file, for example. On Mon, Sep 7, 2015 at 11:47 AM

Re: Problem with repartition/OOM

2015-09-06 Thread Yana Kadiyska
tions in parallel. It will run out of > memory if (number of partitions) times (Parquet block size) is greater than > the available memory. You can try to decrease the number of partitions. And > could you share the value of "parquet.block.size" and your available memory? >

Problem with repartition/OOM

2015-09-05 Thread Yana Kadiyska
Hi folks, I have a strange issue. Trying to read a 7G file and do failry simple stuff with it: I can read the file/do simple operations on it. However, I'd prefer to increase the number of partitions in preparation for more memory-intensive operations (I'm happy to wait, I just need the job to com

Re: Failing to include multiple JDBC drivers

2015-09-05 Thread Yana Kadiyska
If memory serves me correctly in 1.3.1 at least there was a problem with when the driver was added -- the right classloader wasn't picking it up. You can try searching the archives, but the issue is similar to these threads: http://stackoverflow.com/questions/30940566/connecting-from-spark-pyspark-

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Yana Kadiyska
The PermGen space error is controlled with MaxPermSize parameter. I run with this in my pom, I think copied pretty literally from Spark's own tests... I don't know what the sbt equivalent is but you should be able to pass it...possibly via SBT_OPTS? org.scalatest sca

[SQL/Hive] Trouble with refreshTable

2015-08-25 Thread Yana Kadiyska
I'm having trouble with refreshTable, I suspect because I'm using it incorrectly. I am doing the following: 1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet 2. use registerTempTable to register my dataframe 3. A new file is dropped under /foo/bar/ 4. Call hiveContext.refres

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-24 Thread Yana Kadiyska
When I run spark-shell, I run with "--master > mesos://cluster-1:5050" parameter which is the same with "spark-submit". > Confused here. > > > > 2015-07-22 20:01 GMT-05:00 Yana Kadiyska : > >> Is it complaining about "collect" or "toMap&

Help with Dataframe syntax ( IN / COLLECT_SET)

2015-07-23 Thread Yana Kadiyska
Hi folks, having trouble expressing IN and COLLECT_SET on a dataframe. In other words, I'd like to figure out how to write the following query: "select collect_set(b),a from mytable where c in (1,2,3) group by a" I've started with someDF .where( -- not sure what do for c here--- .groupBy($

Re: spark-submit and spark-shell behaviors mismatch.

2015-07-22 Thread Yana Kadiyska
Is it complaining about "collect" or "toMap"? In either case this error is indicative of an old version usually -- any chance you have an old installation of Spark somehow? Or scala? You can try running spark-submit with --verbose. Also, when you say it runs with spark-shell do you run spark shell

Re: Select all columns except some

2015-07-16 Thread Yana Kadiyska
Have you tried to examine what clean_cols contains -- I'm suspect of this part mkString(“, “). Try this: val clean_cols : Seq[String] = df.columns... if you get a type error you need to work on clean_cols (I suspect yours is of type String at the moment and presents itself to Spark as a single col

PairRDDFunctions and DataFrames

2015-07-16 Thread Yana Kadiyska
Hi, could someone point me to the recommended way of using countApproxDistinctByKey with DataFrames? I know I can map to pair RDD but I'm wondering if there is a simpler method? If someone knows if this operations is expressible in SQL that information would be most appreciated as well.

Re: How to solve ThreadException in Apache Spark standalone Java Application

2015-07-14 Thread Yana Kadiyska
Have you seen this SO thread: http://stackoverflow.com/questions/13471519/running-daemon-with-exec-maven-plugin This seems to be more related to the plugin than Spark, looking at the stack trace On Tue, Jul 14, 2015 at 8:11 AM, Hafsa Asif wrote: > I m still looking forward for the answer. I

Re: Spark on Tomcat has exception IncompatibleClassChangeError: Implementing class

2015-07-13 Thread Yana Kadiyska
Oh, this is very interesting -- can you explain about your dependencies -- I'm running Tomcat 7 and ended up using spark-assembly from WEB_INF/lib and removing the javax/servlet package out of it...but it's a pain in the neck. If I'm reading your first message correctly you use hadoop common and sp

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska
t: Row): Validator = { > > var check1: Boolean = if (input.getDouble(shortsale_in_pos) > > 140.0) true else false > > if (check1) this else Nomatch > > } > > } > > > > Saif > > > > *From:* Yana Kadiyska [mailto:yana.kadiy...@gmail.co

Re: java.io.InvalidClassException

2015-07-13 Thread Yana Kadiyska
It's a bit hard to tell from the snippets of code but it's likely related to the fact that when you serialize instances the enclosing class, if any, also gets serialized, as well as any other place where fields used in the closure come from...e.g.check this discussion: http://stackoverflow.com/ques

Re: SparkSQL 'describe table' tries to look at all records

2015-07-13 Thread Yana Kadiyska
Have you seen https://issues.apache.org/jira/browse/SPARK-6910I opened https://issues.apache.org/jira/browse/SPARK-6984 which I think is related to this as well. There are a bunch of issues attached to it but basically yes, Spark interactions with a large metastore are bad...very bad if your me

Re: [SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
| 32| 0| | 1| 32| 1| +---+---+---+ ​ On Thu, Jul 9, 2015 at 11:54 AM, ayan guha wrote: > Can you please post result of show()? > On 10 Jul 2015 01:00, "Yana Kadiyska" wrote: > >> Hi folks, I just re-wrote a query from using UNION ALL to use "with >> ro

[SparkSQL] Incorrect ROLLUP results

2015-07-09 Thread Yana Kadiyska
Hi folks, I just re-wrote a query from using UNION ALL to use "with rollup" and I'm seeing some unexpected behavior. I'll open a JIRA if needed but wanted to check if this is user error. Here is my code: case class KeyValue(key: Int, value: String) val df = sc.parallelize(1 to 50).map(i=>KeyValue(

How to debug java.io.OptionalDataException issues

2015-07-06 Thread Yana Kadiyska
Hi folks, suffering from a pretty strange issue: Is there a way to tell what object is being successfully serialized/deserialized? I have a maven-installed jar that works well when fat jarred within another, but shows the following stack when marked as provided and copied to the runtime classpath.

Difference between spark-defaults.conf and SparkConf.set

2015-06-30 Thread Yana Kadiyska
Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath spark.driver.extraClassPath to point to some external JARs. If I set them in spark-defaults.conf everything works perfectly. However, if I remove spark-defaults.conf and just create a SparkConf and call .set(

Re: Debugging Apache Spark clustered application from Eclipse

2015-06-25 Thread Yana Kadiyska
Pass that debug string to your executor like this: --conf spark.executor.extraJavaOptions="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address= 7761". When your executor is launched it will send debug information on port 7761. When you attach the Eclipse debugger, you need to have the IP

Re: Can Spark1.4 work with CDH4.6

2015-06-24 Thread Yana Kadiyska
try? > > Thanks > Best Regards > > On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska > wrote: > >> Hi folks, I have been using Spark against an external Metastore service >> which runs Hive with Cdh 4.6 >> >> In Spark 1.2, I was able to successfully connect

Re: Spark stream test throw org.apache.spark.SparkException: Task not serializable when execute in spark shell

2015-06-24 Thread Yana Kadiyska
I can't tell immediately, but you might be able to get more info with the hint provided here: http://stackoverflow.com/questions/27980781/spark-task-not-serializable-with-simple-accumulator (short version, set -Dsun.io.serialization.extendedDebugInfo=true) Also, unless you're simplifying your exam

Can Spark1.4 work with CDH4.6

2015-06-23 Thread Yana Kadiyska
Hi folks, I have been using Spark against an external Metastore service which runs Hive with Cdh 4.6 In Spark 1.2, I was able to successfully connect by building with the following: ./make-distribution.sh --tgz -Dhadoop.version=2.0.0-mr1-cdh4.2.0 -Phive-thriftserver -Phive-0.12.0 I see that in S

Re: spark sql and cassandra. spark generate 769 tasks to read 3 lines from cassandra table

2015-06-17 Thread Yana Kadiyska
Can you show some code how you're doing the reads? Have you successfully read other stuff from Cassandra (i.e. do you have a lot of experience with this path and this particular table is causing issues or are you trying to figure out the right way to do a read). What version of Spark and Cassandra

Re: DataFrame insertIntoJDBC parallelism while writing data into a DB table

2015-06-16 Thread Yana Kadiyska
When all else fails look at the source ;) Looks like createJDBCTable is deprecated, but otherwise goes to the same implementation as insertIntoJDBC... https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala You can also look at DataFrameWriter in t

ClassNotFound exception from closure

2015-06-16 Thread Yana Kadiyska
Hi folks, running into a pretty strange issue -- I have a ClassNotFound exception from a closure?! My code looks like this: val jRdd1 = table.map(cassRow=>{ val lst = List(cassRow.get[Option[Any]](0),cassRow.get[Option[Any]](1)) Row.fromSeq(lst) }) println(s"This one worked .

Re: Reopen Jira or New Jira

2015-06-11 Thread Yana Kadiyska
John, I took the liberty of reopening because I have sufficient JIRA permissions (not sure if you do). It would be good if you can add relevant comments/investigations there. On Thu, Jun 11, 2015 at 8:34 AM, John Omernik wrote: > Hey all, from my other post on Spark 1.3.1 issues, I think we foun

Re: Cassandra Submit

2015-06-10 Thread Yana Kadiyska
; ... 28 more >> >> >> >> The Spark Cassandra Connector is trying to use a method, which does not >> exists. That means your assembly jar has the wrong version of the library >> that SCC is trying to use. Welcome to jar hell! >> >> &

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
my code from datastax spark-cassandra-connector > <https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector-demos/simple-demos/src/main/java/com/datastax/spark/connector/demo/JavaApiDemo.java> > . > > Thanx alot. > yasemin > > 2015-06

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
160 >> >> I check the port "nc -z localhost 9160; echo $?" it returns me "0". I >> think it close, should I open this port ? >> >> 2015-06-09 16:55 GMT+03:00 Yana Kadiyska : >> >>> Is your cassandra installation actually listeni

Re: Cassandra Submit

2015-06-09 Thread Yana Kadiyska
"spark.cassandra.connection.host", "localhost") >> .set("spark.cassandra.connection.rpc.port", "9160"); >> >> whatever I write setting, I get same exception. Any help ?? >> >> >> 2015-06-08 18:23 GMT+03:00 Yana Kadiyska : >

Re: Cassandra Submit

2015-06-08 Thread Yana Kadiyska
yes, whatever you put for listen_address in cassandra.yaml. Also, you should try to connect to your cassandra cluster via bin/cqlsh to make sure you have connectivity before you try to make a a connection via spark. On Mon, Jun 8, 2015 at 4:43 AM, Yasemin Kaya wrote: > Hi, > I run my project on

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
n compile my app to run this without -Dconfig.file=alt_ > reference1.conf? > > 2015-06-02 15:43 GMT+02:00 Yana Kadiyska : > >> This looks like your app is not finding your Typesafe config. The config >> should usually be placed in particular folder under your app to be seen &

Re: build jar with all dependencies

2015-06-02 Thread Yana Kadiyska
This looks like your app is not finding your Typesafe config. The config should usually be placed in particular folder under your app to be seen correctly. If it's in a non-standard location you can pass -Dconfig.file=alt_reference1.conf to java to tell it where to look. If this is a config that b

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska
Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF df.registerTempTable("table") sqlContext.sql("select percentile(key,0.5) from table").show() ​ On Tue, Jun 2, 2015 at 8:07 AM,

Re: Execption writing on two cassandra tables NoHostAvailableException: All host(s) tried for query failed (no host was tried)

2015-05-29 Thread Yana Kadiyska
are you able to connect to your cassandra installation via cassandra_home$ ./bin/cqlsh This exception generally means that your cassandra instance is not reachable/accessible On Fri, May 29, 2015 at 6:11 AM, Antonio Giambanco wrote: > Hi all, > I have in a single server installed spark 1.3.1 a

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-27 Thread Yana Kadiyska
t; 15/05/27 11:37:37 WARN SparkDeploySchedulerBackend: Application ID is not >> initialized yet. >> 1 >> >> >> Even when successful, the time for the Master to come up has a >> surprisingly high variance. I am running on a single machine for which >> there is

Re: hive external metastore connection timeout

2015-05-27 Thread Yana Kadiyska
I have not run into this particular issue but I'm not using latest bits in production. However, testing your theory should be easy -- MySQL is just a database, so you should be able to use a regular mysql client and see how many connections are active. You can then compare to the maximum allowed co

Need some Cassandra integration help

2015-05-26 Thread Yana Kadiyska
Hi folks, for those of you working with Cassandra, wondering if anyone has been successful processing a mix of Cassandra and hdfs data. I have a dataset which is stored partially in HDFS and partially in Cassandra (schema is the same in both places) I am trying to do the following: val dfHDFS = s

Re: spark.executor.extraClassPath - Values not picked up by executors

2015-05-22 Thread Yana Kadiyska
Todd, I don't have any answers for you...other than the file is actually named spark-defaults.conf (not sure if you made a typo in the email or misnamed the file...). Do any other options from that file get read? I also wanted to ask if you built the spark-cassandra-connector-assembly-1.3 .0-SNAPS

Re: Unable to use hive queries with constants in predicates

2015-05-21 Thread Yana Kadiyska
I have not seen this error but have seen another user have weird parser issues before: http://mail-archives.us.apache.org/mod_mbox/spark-user/201501.mbox/%3ccag6lhyed_no6qrutwsxeenrbqjuuzvqtbpxwx4z-gndqoj3...@mail.gmail.com%3E I would attach a debugger and see what is going on -- if I'm looking a

Re: Storing data in MySQL from spark hive tables

2015-05-20 Thread Yana Kadiyska
I'm afraid you misunderstand the purpose of hive-site.xml. It configures access to the Hive metastore. You can read more here: http://www.hadoopmaterial.com/2013/11/metastore.html. So the MySQL DB in hive-site.xml would be used to store hive-specific data such as schema info, partition info, etc.

Re: Intermittent difficulties for Worker to contact Master on same machine in standalone

2015-05-20 Thread Yana Kadiyska
But if I'm reading his email correctly he's saying that: 1. The master and slave are on the same box (so network hiccups are unlikely culprit) 2. The failures are intermittent -- i.e program works for a while then worker gets disassociated... Is it possible that the master restarted? We used to h

Re: store hive metastore on persistent store

2015-05-16 Thread Yana Kadiyska
er all (still if I go to spark-shell and try to print out the SQL >> settings that I put in hive-site.xml, it does not print them). >> >> >> On Fri, May 15, 2015 at 7:22 PM, Yana Kadiyska >> wrote: >> >>> My point was more to how to verify that propert

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska
59:33 INFO HiveMetaStore: Added admin role in metastore > 15/05/15 17:59:34 INFO HiveMetaStore: Added public role in metastore > 15/05/15 17:59:34 INFO HiveMetaStore: No user is added in admin role, > since config is empty > 15/05/15 17:59:35 INFO SessionState: No Tez session required at this &

Re: SPARK-4412 regressed?

2015-05-15 Thread Yana Kadiyska
Thanks Sean, with the added permissions I do now have this extra option. On Fri, May 15, 2015 at 11:20 AM, Sean Owen wrote: > (I made you a Contributor in JIRA -- your yahoo-related account of the > two -- so maybe that will let you do so.) > > On Fri, May 15, 2015 at 4:19 PM, Y

SPARK-4412 regressed?

2015-05-15 Thread Yana Kadiyska
Hi, two questions 1. Can regular JIRA users reopen bugs -- I can open a new issue but it does not appear that I can reopen issues. What is the proper protocol to follow if we discover regressions? 2. I believe SPARK-4412 regressed in Spark 1.3.1, according to this SO thread possibly even in 1.3.0

Re: store hive metastore on persistent store

2015-05-15 Thread Yana Kadiyska
This should work. Which version of Spark are you using? Here is what I do -- make sure hive-site.xml is in the conf directory of the machine you're using the driver from. Now let's run spark-shell from that machine: scala> val hc= new org.apache.spark.sql.hive.HiveContext(sc) hc: org.apache.spark.

[SparkSQL] Partition Autodiscovery (Spark 1.3)

2015-05-12 Thread Yana Kadiyska
Hi folks, I'm trying to use Automatic partition discovery as descibed here: https://databricks.com/blog/2015/03/24/spark-sql-graduates-from-alpha-in-spark-1-3.html /data/year=2014/file.parquet/data/year=2015/file.parquet … SELECT * FROM table WHERE year = 2015 I have an official 1.3.1 CDH4 bui

Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread Yana Kadiyska
Here is the JIRA: https://issues.apache.org/jira/browse/SPARK-3928 Looks like for now you'd have to list the full paths...I don't see a comment from an official spark committer so still not sure if this is a bug or design, but it seems to be the current state of affairs. On Thu, May 7, 2015 at 8:

Escaping user input for Hive queries

2015-05-05 Thread Yana Kadiyska
Hi folks, we have been using the a JDBC connection to Spark's Thrift Server so far and using JDBC prepared statements to escape potentially malicious user input. I am trying to port our code directly to HiveContext now (i.e. eliminate the use of Thrift Server) and I am not quite sure how to genera

[ThriftServer] Urgent -- very slow Metastore query from Spark

2015-04-16 Thread Yana Kadiyska
Hi Sparkers, hoping for insight here: running a simple describe mytable here where mytable is a partitioned Hive table. Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 ​ Whereas Hive over the same

[SparkSQL; Thriftserver] Help tracking missing 5 minutes

2015-04-15 Thread Yana Kadiyska
Hi Spark users, Trying to upgrade to Spark1.2 and running into the following seeing some very slow queries and wondering if someone can point me in the right direction for debugging. My Spark UI shows a job with duration 15s (see attached screenshot). Which would be great but client side measurem

[ThriftServer] User permissions warning

2015-04-08 Thread Yana Kadiyska
Hi folks, I am noticing a pesky and persistent warning in my logs (this is from Spark 1.2.1): 15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception trying to get groups for user anonymous org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user at org.apach

Re: Spark Avarage

2015-04-06 Thread Yana Kadiyska
If you're going to do it this way, I would ouput dayOfdate.substring(0,7), i.e. the month part, and instead of weatherCond, you can use (month,(minDeg,maxDeg,meanDeg)) --i.e. PairRDD. So weathersRDD: RDD[(String,(Double,Double,Double))]. Then use a reduceByKey as shown in multiple Spark examples..Y

DataFrame -- help with encoding factor variables

2015-04-06 Thread Yana Kadiyska
Hi folks, currently have a DF that has a factor variable -- say gender. I am hoping to use the RandomForest algorithm on this data an it appears that this needs to be converted to RDD[LabeledPoint] first -- i.e. all features need to be double-encoded. I see https://issues.apache.org/jira/browse/S

[SQL] Simple DataFrame questions

2015-04-02 Thread Yana Kadiyska
Hi folks, having some seemingly noob issues with the dataframe API. I have a DF which came from the csv package. 1. What would be an easy way to cast a column to a given type -- my DF columns are all typed as strings coming from a csv. I see a schema getter but not setter on DF 2. I am trying to

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-11 Thread Yana Kadiyska
You might also want to see if TaskScheduler helps with that. I have not used it with Windows 2008 R2 but it generally does allow you to schedule a bat file to run on startup On Wed, Mar 11, 2015 at 10:16 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > > Thanks for the suggestio

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
e: > Hi yana, > > I have removed hive-site.xml from spark/conf directory but still getting > the same errors. Anyother way to work around. > > Regards, > Sandeep > > On Fri, Feb 27, 2015 at 9:38 PM, Yana Kadiyska > wrote: > >> I think you're mixin

Re: Errors in spark

2015-02-27 Thread Yana Kadiyska
I think you're mixing two things: the docs say "When* not *configured by the hive-site.xml, the context automatically creates metastore_db and warehouse in the current directory.". AFAIK if you want a local metastore, you don't put hive-site.xml anywhere. You only need the file if you're going to p

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Yong, for the 200 tasks in stage 2 and 3 -- this actually comes from the shuffle setting: spark.sql.shuffle.partitions On Thu, Feb 26, 2015 at 5:51 PM, java8964 wrote: > Imran, thanks for your explaining about the parallelism. That is very > helpful. > > In my test case, I am only use one box cl

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Imran, I have also observed the phenomenon of reducing the cores helping with OOM. I wanted to ask this (hopefully without straying off topic): we can specify the number of cores and the executor memory. But we don't get to specify _how_ the cores are spread among executors. Is it possible that wi

[SparkSQL, Spark 1.2] UDFs in group by broken?

2015-02-26 Thread Yana Kadiyska
Can someone confirm if they can run UDFs in group by in spark1.2? I have two builds running -- one from a custom build from early December (commit 4259ca8dd12) which works fine, and Spark1.2-RC2. On the latter I get: jdbc:hive2://XXX.208:10001> select from_unixtime(epoch,'-MM-dd-HH'),count(

Re: Running multiple threads with same Spark Context

2015-02-25 Thread Yana Kadiyska
after setting the property > "spark.scheduler.mode" to FAIR. But the result is same as previous. Are > there any other properties that have to be set? > > > On Tue, Feb 24, 2015 at 10:26 PM, Yana Kadiyska > wrote: > >> It's hard to tell. I have not run

[SparkSQL] Number of map tasks in SparkSQL

2015-02-24 Thread Yana Kadiyska
Shark used to have shark.map.tasks variable. Is there an equivalent for Spark SQL? We are trying a scenario with heavily partitioned Hive tables. We end up with a UnionRDD with a lot of partitions underneath and hence too many tasks: https://github.com/apache/spark/blob/master/sql/hive/src/main/sc

Re: Running multiple threads with same Spark Context

2015-02-24 Thread Yana Kadiyska
It's hard to tell. I have not run this on EC2 but this worked for me: The only thing that I can think of is that the scheduling mode is set to - *Scheduling Mode:* FAIR val pool: ExecutorService = Executors.newFixedThreadPool(poolSize) while_loop to get curr_job pool.execute(new ReportJ

Re: Executor size and checkpoints

2015-02-24 Thread Yana Kadiyska
gt; restart, and the new config took affect. Maybe. :) > > TD > > On Sat, Feb 21, 2015 at 7:30 PM, Yana Kadiyska > wrote: > >> Hi all, >> >> I had a streaming application and midway through things decided to up the >> executor memory. I spent a long t

Executor size and checkpoints

2015-02-21 Thread Yana Kadiyska
Hi all, I had a streaming application and midway through things decided to up the executor memory. I spent a long time launching like this: ~/spark-1.2.0-bin-cdh4/bin/spark-submit --class StreamingTest --executor-memory 2G --master... and observing the executor memory is still at old 512 setting

textFile partitions

2015-02-09 Thread Yana Kadiyska
Hi folks, puzzled by something pretty simple: I have a standalone cluster with default parallelism of 2, spark-shell running with 2 cores sc.textFile("README.md").partitions.size returns 2 (this makes sense) sc.textFile("README.md").coalesce(100,true).partitions.size returns 100, also makes sense

Re: spark-shell has syntax error on windows.

2015-01-23 Thread Yana Kadiyska
wrote: > Do you mind filing a JIRA issue for this which includes the actual error > message string that you saw? https://issues.apache.org/jira/browse/SPARK > > On Thu, Jan 22, 2015 at 8:31 AM, Yana Kadiyska > wrote: > >> I am not sure if you get the same exception as I do

Re: Exception: NoSuchMethodError: org.apache.spark.streaming.StreamingContext$.toPairDStreamFunctions

2015-01-23 Thread Yana Kadiyska
if you're running the test via sbt you can examine the classpath that sbt uses for the test (show runtime:full-classpath or last run)-- I find this helps once too many includes and excludes interact. On Thu, Jan 22, 2015 at 3:50 PM, Adrian Mocanu wrote: > > I use spark 1.1.0-SNAPSHOT and the tes

Re: Results never return to driver | Spark Custom Reader

2015-01-23 Thread Yana Kadiyska
It looks to me like your executor actually crashed and didn't just finish properly. Can you check the executor log? It is available in the UI, or on the worker machine, under $SPARK_HOME/work/ app-20150123155114-/6/stderr (unless you manually changed the work directory location but in that

Re: Installing Spark Standalone to a Cluster

2015-01-22 Thread Yana Kadiyska
You can do ./sbin/start-slave.sh --master spark://IP:PORT. I believe you're missing --master. In addition, it's a good idea to pass with --master exactly the spark master's endpoint as shown on your UI under http://localhost:8080. But that should do it. If that's not working, you can look at the Wo

Re: spark-shell has syntax error on windows.

2015-01-22 Thread Yana Kadiyska
I am not sure if you get the same exception as I do -- spark-shell2.cmd works fine for me. Windows 7 as well. I've never bothered looking to fix it as it seems spark-shell just calls spark-shell2 anyway... On Thu, Jan 22, 2015 at 3:16 AM, Vladimir Protsenko wrote: > I have a problem with running

Re: [SparkSQL] Try2: Parquet predicate pushdown troubles

2015-01-21 Thread Yana Kadiyska
o put > parquet.task.side.metadata config into Hadoop core-site.xml, and then > re-run the query. I can see significant differences by doing so. > > I’ll open a JIRA and deliver a fix for this ASAP. Thanks again for > reporting all the details! > > Cheng > > On 1/13/15 12:56 P

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-20 Thread Yana Kadiyska
o update if you figure this out! On Mon, Jan 19, 2015 at 8:02 PM, Xiaoyu Wang wrote: > The *spark.sql.parquet.**filterPushdown=true *has been turned on. But set > *spark.sql.hive.**convertMetastoreParquet *to *false*. the first > parameter is lose efficacy!!! > > 2015-01-20 6:52

Re: Why custom parquet format hive table execute "ParquetTableScan" physical plan, not "HiveTableScan"?

2015-01-19 Thread Yana Kadiyska
If you're talking about filter pushdowns for parquet files this also has to be turned on explicitly. Try *spark.sql.parquet.**filterPushdown=true . *It's off by default On Mon, Jan 19, 2015 at 3:46 AM, Xiaoyu Wang wrote: > Yes it works! > But the filter can't pushdown!!! > > If custom parquetin

  1   2   >