Setting log level to DEBUG while keeping httpclient.wire on WARN

2018-06-29 Thread Daniel Haviv
Hi, I'm trying to debug an issue with Spark so I've set log level to DEBUG but at the same time I'd like to avoid the httpclient.wire's verbose output by setting it to WARN. I tried the following log4.properties config but I'm still getting DEBUG outputs for httpclient.wire:

Thrift server not exposing temp tables (spark.sql.hive.thriftServer.singleSession=true)

2018-05-30 Thread Daniel Haviv
Hi, I would like to expose a DF through the Thrift server, but even though I enable spark.sql.hive.thriftServer.singleSession I still can't see the temp table. I'm using Spark 2.2.0: spark-shell --conf spark.sql.hive.thriftServer.singleSession=true import

Writing a UDF that works with an Interval in PySpark

2017-12-11 Thread Daniel Haviv
Hi, I'm trying to write a variant of date_add that accepts an interval as a second parameter so that I could use the following syntax with SparkSQL: select date_add(cast('1970-01-01' as date), interval 1 day) but I'm getting the following error: ValueError: (ValueError(u'Could not parse datatype:

Re: UDF issues with spark

2017-12-10 Thread Daniel Haviv
Some code would help to debug the issue On Fri, 8 Dec 2017 at 21:54 Afshin, Bardia < bardia.afs...@changehealthcare.com> wrote: > Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when > running the udf function on a recordset count of 10 with a mapping of the > same value

Writing custom Structured Streaming receiver

2017-11-01 Thread Daniel Haviv
Hi, Is there a guide to writing a custom Structured Streaming receiver? Thank you. Daniel

Optimizing dataset joins

2017-05-18 Thread Daniel Haviv
Hi, With RDDs it was possible to define a partitioner for two RDDS and given that two RDDs have the same partitioner, a join operation would be performed local to the partition without shuffling. Can dataset joins be optimized in the same way ? Is it enough to repartition two datasets on the the

[Spark-SQL] Hive support is required to select over the following tables

2017-02-08 Thread Daniel Haviv
Hi, I'm using Spark 2.1.0 on Zeppelin. I can successfully create a table but when I try to select from it I fail: spark.sql("create table foo (name string)") res0: org.apache.spark.sql.DataFrame = [] spark.sql("select * from foo") org.apache.spark.sql.AnalysisException: Hive support is required

Re: Not per-key state in spark streaming

2016-12-08 Thread Daniel Haviv
There's no need to extend Spark's API, look at mapWithState for examples. On Thu, Dec 8, 2016 at 4:49 AM, Anty Rao wrote: > > > On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao wrote: > >> Hi >> I'm new to Spark. I'm doing some research to see if spark streaming

Re: Not per-key state in spark streaming

2016-12-07 Thread Daniel Haviv
Hi Anty, What you could do is keep in the state only the existence of a key and when necessary pull it from a secondary state store like HDFS or HBASE. Daniel On Wed, Dec 7, 2016 at 1:42 PM, Anty Rao wrote: > Hi > I'm new to Spark. I'm doing some research to see if spark

Re: How to load only the data of the last partition

2016-11-17 Thread Daniel Haviv
Hi Samy, If you're working with hive you could create a partitioned table and update it's partitions' locations to the last version so when you'll query it using spark, you'll always get the latest version. Daniel On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane wrote: > Hi, > >

Using mapWithState without a checkpoint

2016-11-17 Thread Daniel Haviv
Hi, Is it possible to use mapWithState without checkpointing at all ? I'd rather have the whole application fail, restart and reload an initialState RDD than pay for checkpointing every 10 batches. Thank you, Daniel

mapWithState job slows down & exceeds yarn's memory limits

2016-11-14 Thread Daniel Haviv
Hi, I have a fairly simple stateful streaming job that suffers from high GC and it's executors are killed as they are exceeding the size of the requested container. My current executor-memory is 10G, spark overhead is 2G and it's running with one core. At first the job begins running at a rate

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
Scratch that, it's working fine. Thank you. On Tue, Nov 8, 2016 at 8:19 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I should have used transform instead of map > > val x: DStream[(String, Record)] = > kafkaStream.transform(x=>{x

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
ue, Nov 8, 2016 at 7:46 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] : > > val kafkaStream = KafkaUtils.createDirectStream[String, String, > StringDecoder, StringDecoder](ssc

mapWithState with Datasets

2016-11-08 Thread Daniel Haviv
Hi, I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] : val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet) val stateStream: DStream[RDD[(String, Record)]] = kafkaStream.map(x=> {

mapWithState with a big initial RDD gets OOM'ed

2016-11-07 Thread Daniel Haviv
Hi, I have a stateful streaming app where I pass a rather large initialState RDD at the beginning. No matter to how many partitions I divide the stateful stream I keep failing on OOM or Java heap space. Is there a way to make it more resilient? how can I control it's storage level? This is

mapWithState and DataFrames

2016-11-06 Thread Daniel Haviv
Hi, How can I utilize mapWithState and DataFrames? Right now I stream json messages from Kafka, update their state, output the updated state as json and compose a dataframe from it. It seems inefficient both in terms of processing and storage (a long string instead of a compact DF). Is there a

Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi Baki, It's enough for the producer to write the messages compressed. See here: https://cwiki.apache.org/confluence/display/KAFKA/Compression Thank you. Daniel > On 3 Nov 2016, at 21:27, Daniel Haviv <daniel.ha...@veracity-group.com> wrote: > > Hi, > Kafka can compre

Re: Stream compressed data from KafkaUtils.createDirectStream

2016-11-03 Thread Daniel Haviv
Hi, Kafka can compress/uncompress your messages for you seamlessly, adding compression on top of that will be redundant. Thank you. Daniel > On 3 Nov 2016, at 20:53, bhayat wrote: > > Hello, > > I really wonder that whether i can stream compressed data with using >

error: Unable to find encoder for type stored in a Dataset. when trying to map through a DataFrame

2016-11-02 Thread Daniel Haviv
Hi, I have the following scenario: scala> val df = spark.sql("select * from danieltest3") df: org.apache.spark.sql.DataFrame = [iid: string, activity: string ... 34 more fields] Now I'm trying to map through the rows I'm getting: scala> df.map(r=>r.toSeq) :32: error: Unable to find encoder for

partitionBy produces wrong number of tasks

2016-10-19 Thread Daniel Haviv
Hi, I have a case where I use partitionBy to write my DF using a calculated column, so it looks somethings like this: val df = spark.sql("select *, from_unixtime(ts, 'MMddH') partition_key from mytable") df.write.partitionBy("partition_key").orc("/partitioned_table") df is 8 partitions in

Using DirectOutputCommitter with ORC

2016-07-25 Thread Daniel Haviv
Hi, How can the DirectOutputCommitter be utilized for writing ORC files? I tried setting it via: sc.getConf.set("spark.hadoop.mapred.output.committer.class","com.veracity-group.datastage.directorcwriter") But I can still see a _temporary directory being used when I save my dataframe as ORC.

Spark (on Windows) not picking up HADOOP_CONF_DIR

2016-07-17 Thread Daniel Haviv
Hi, I'm running Spark using IntelliJ on Windows and even though I set HADOOP_CONF_DIR it does not affect the contents of sc.hadoopConfiguration. Anybody encountered it ? Thanks, Daniel

Re: Spark Streaming - Best Practices to handle multiple datapoints arriving at different time interval

2016-07-16 Thread Daniel Haviv
Or use mapWithState Thank you. Daniel > On 16 Jul 2016, at 03:02, RK Aduri wrote: > > You can probably define sliding windows and set larger batch intervals. > > > > -- > View this message in context: >

Spark Streaming: Refreshing broadcast value after each batch

2016-07-12 Thread Daniel Haviv
Hi, I have a streaming application which uses a broadcast variable which I populate from a database. I would like every once in a while (or even every batch) to update/replace the broadcast variable with the latest data from the database. Only way I found online to do this is this "hackish" way (

Confusion regarding sc.accumulableCollection(mutable.ArrayBuffer[String]()) type

2016-06-23 Thread Daniel Haviv
Hi, I want to to use an accumulableCollection of type mutable.ArrayBuffer[String ] to return invalid records detected during transformations but I don't quite get it's type: val errorAccumulator: Accumulable[ArrayBuffer[String], String] = sc.accumulableCollection(mutable.ArrayBuffer[String]())

Re: Switching broadcast mechanism from torrrent

2016-06-20 Thread Daniel Haviv
gt; > On Sun, Jun 19, 2016 at 7:10 AM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> How about using `transient` annotations? >> >> // maropu >> >> On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv < >> daniel.ha...@veracity-group.

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Daniel Haviv
> Anyway, I couldn't find a root cause only from the stacktraces... > > // maropu > > > > > On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > >> Hi, >> I've set spark.broadcast.factory to >> org.ap

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-10 Thread Daniel Haviv
I'm using EC2 instances Thank you. Daniel > On 9 Jun 2016, at 16:49, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi, > > are you using EC2 instances or local cluster behind firewall. > > > Regards, > Gourav Sengupta > >> On

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi, I've set these properties both in core-site.xml and hdfs-site.xml with no luck. Thank you. Daniel > On 9 Jun 2016, at 01:11, Steve Loughran <ste...@hortonworks.com> wrote: > > >> On 8 Jun 2016, at 16:34, Daniel Haviv <daniel.ha...@veracity-group.com> >>

HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv
Hi, I'm trying to create a table on s3a but I keep hitting the following error: Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain) I

groupByKey returns an emptyRDD

2016-06-06 Thread Daniel Haviv
Hi, I'm wrapped the following code into a jar: val test = sc.parallelize(Seq(("daniel", "a"), ("daniel", "b"), ("test", "1)"))) val agg = test.groupByKey() agg.collect.foreach(r=>{println(r._1)}) The result of groupByKey is an empty RDD, when I'm trying the same code using the spark-shell it's

Re: Switching broadcast mechanism from torrrent

2016-06-06 Thread Daniel Haviv
parameter to switch broadcast > method. > > Can you describe the issues with torrent broadcast in more detail ? > > Which version of Spark are you using ? > > Thanks > > On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > &g

Switching broadcast mechanism from torrrent

2016-06-01 Thread Daniel Haviv
Hi, Our application is failing due to issues with the torrent broadcast, is there a way to switch to another broadcast method ? Thank you. Daniel

Re: "collecting" DStream data

2016-05-15 Thread Daniel Haviv
eption. Also, you can’t update the > value of a broadcast variable, since it’s immutable. > > Thanks, > Silvio > > From: Daniel Haviv <daniel.ha...@veracity-group.com> > Date: Sunday, May 15, 2016 at 6:23 AM > To: user <user@spark.apache.org> > Subject: &qu

"collecting" DStream data

2016-05-15 Thread Daniel Haviv
Hi, I have a DStream I'd like to collect and broadcast it's values. To do so I've created a mutable HashMap which i'm filling with foreachRDD but when I'm checking it, it remains empty. If I use ArrayBuffer it works as expected. This is my code: val arr =

java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Daniel Haviv
Hi, I'm running a very simple job (textFile->map->groupby->count) and hitting this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception when running on yarn-client and not in local mode: 16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1,

Re: Is it a bug?

2016-05-09 Thread Daniel Haviv
How come that for the first() function it calculates an updated value and for collect it doesn't ? On Sun, May 8, 2016 at 4:17 PM, Ted Yu wrote: > I don't think so. > RDD is immutable. > > > On May 8, 2016, at 2:14 AM, Sisyphuss wrote: > > > >

Re: Save DataFrame to HBase

2016-04-27 Thread Daniel Haviv
lugin to work? I have CDH 5.5.2 installed which > comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work? > > Thanks, > Ben > >> On Apr 24, 2016, at 1:43 AM, Daniel Haviv <daniel.ha...@veracity-group.com> >> wrote: >> >> Hi, >>

Re: Save DataFrame to HBase

2016-04-24 Thread Daniel Haviv
Hi, I tried saving DF to HBase using a hive table with hbase storage handler and hiveContext but it failed due to a bug. I was able to persist the DF to hbase using Apache Pheonix which was pretty simple. Thank you. Daniel > On 21 Apr 2016, at 16:52, Benjamin Kim wrote: >

Spark fileStream from a partitioned hive dir

2016-04-13 Thread Daniel Haviv
Hi, We have a hive table which gets data written to it by two partition keys, day and hour. We would like to stream the incoming files assince fileStream can only listen on one directory we start a streaming job on the latest partition and every hour kill it and start a new one on a newer

Re: aggregateByKey on PairRDD

2016-03-30 Thread Daniel Haviv
Hi, shouldn't groupByKey be avoided ( https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) ? Thank you,. Daniel On Wed, Mar 30, 2016 at 9:01 AM, Akhil Das wrote: > Isn't it what

Flume with Spark Streaming Sink

2016-03-20 Thread Daniel Haviv
Hi, I'm trying to use the Spark Sink with Flume but it seems I'm missing some of the dependencies. I'm running the following code: ./bin/spark-shell --master yarn --jars

One task hangs and never finishes

2015-12-17 Thread Daniel Haviv
Hi, I have an application running a set of transformations and finishes with saveAsTextFile. Out of 80 tasks all finish pretty fast but one that just hangs and outputs these message to STDERR: 5/12/17 17:22:19 INFO collection.ExternalAppendOnlyMap: Thread 82 spilling in-memory map of 4.0 GB to

Re: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
I will Thank you. > On 27 באוק׳ 2015, at 4:54, Felix Cheung wrote: > > Please open a JIRA? > > > Date: Mon, 26 Oct 2015 15:32:42 +0200 > Subject: HiveContext ignores ("skip.header.line.count"="1") > From: daniel.ha...@veracity-group.com > To: user@spark.apache.org

HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv
Hi, I have a csv table in Hive which is configured to skip the header row using TBLPROPERTIES("skip.header.line.count"="1"). When querying from Hive the header row is not included in the data, but when running the same query via HiveContext I get the header row. I made sure that HiveContext sees

Generated ORC files cause NPE in Hive

2015-10-13 Thread Daniel Haviv
Hi, We are inserting streaming data into a hive orc table via a simple insert statement passed to HiveContext. When trying to read the files generated using Hive 1.2.1 we are getting NPE: at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91) at

Re: SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv
indow on the Dstream) - and it works as expected. > > Check this out for code example: > https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala > > -adrian > > From: Daniel Havi

SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv
Hi, As things that run inside foreachRDD run at the driver, does that mean that if we use SQLContext inside foreachRDD the data is sent back to the driver and only then the query is executed or is it executed at the executors? Thank you. Daniel

Re: Insert via HiveContext is slow

2015-10-09 Thread Daniel Haviv
out soon. > > > > Hao > > > > *From:* Daniel Haviv [mailto:daniel.ha...@veracity-group.com] > *Sent:* Friday, October 9, 2015 3:08 AM > *To:* user > *Subject:* Re: Insert via HiveContext is slow > > > > Forgot to mention that my insert is a multi table insert

Re: Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
microreflection, dsParamLine['unerroreds'] unerroreds, dsParamLine['corrected'] corrected, dsParamLine['uncorrectables'] uncorrectables, from_unixtime(cast(datats/1000 as bigint),'MMdd') day_ts, cmtsid """) On Thu,

Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv
Hi, I'm inserting into a partitioned ORC table using an insert sql statement passed via HiveContext. The performance I'm getting is pretty bad and I was wondering if there are ways to speed things up. Would saving the DF like this

Converting a DStream to schemaRDD

2015-09-29 Thread Daniel Haviv
Hi, I have a DStream which is a stream of RDD[String]. How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ? Thank you. Daniel

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
I tried but I'm getting the same error (task not serializable) > On 25 בספט׳ 2015, at 20:10, Ted Yu <yuzhih...@gmail.com> wrote: > > Is the Schema.parse() call expensive ? > > Can you call it in the closure ? > >> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Ha

java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv
Hi, I'm getting a NotSerializableException even though I'm creating all the my objects from within the closure: import org.apache.avro.generic.GenericDatumReader import java.io.File import org.apache.avro._ val orig_schema = Schema.parse(new File("/home/wasabi/schema")) val READER = new

Reading avro data using KafkaUtils.createDirectStream

2015-09-24 Thread Daniel Haviv
Hi, I'm trying to use KafkaUtils.createDirectStream to read avro messages from Kafka but something is off with my type arguments: val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord], GenericRecord, NullWritable, AvroInputFormat[GenericRecord]](ssc, kafkaParams, topicSet) I'm

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
tty inefficient on almost any file system > > El martes, 22 de septiembre de 2015, Daniel Haviv > <daniel.ha...@veracity-group.com> escribió: >> Hi, >> We are trying to load around 10k avro files (each file holds only one >> record) using spark-avro but it take

spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv
Hi, We are trying to load around 10k avro files (each file holds only one record) using spark-avro but it takes over 15 minutes to load. It seems that most of the work is being done at the driver where it created a broadcast variable for each file. Any idea why is it behaving that way ? Thank

sqlContext.read.avro broadcasting files from the driver

2015-09-21 Thread Daniel Haviv
Hi, I'm loading a 1000 files using the spark-avro package: val df = sqlContext.read.avro(*"/incoming/"*) When I'm performing an action on this df it seems like for each file a broadcast is being created and is sent to the workers (instead of the workers reading their data-local files): scala>

Re: Spark Thrift Server JDBC Drivers

2015-09-17 Thread Daniel Haviv
l work outside of EMR, but it's worth a try. > > I've also used the ODBC driver from Hortonworks > <http://hortonworks.com/products/releases/hdp-2-2/#add_ons>. > > Regards, > Dan > > On Wed, Sep 16, 2015 at 8:34 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> w

Spark Thrift Server JDBC Drivers

2015-09-16 Thread Daniel Haviv
Hi, are there any free JDBC drivers for thrift ? The only ones I could find are Simba's which require a license. Thank, Daniel

Re: Parsing Avro from Kafka Message

2015-09-04 Thread Daniel Haviv
eDirectStream[AvroKey[GenericRecord], > NullWritable, AvroKeyInputFormat[GenericRecord]](..) > > val avroData = avroStream.map(x => x._1.datum().toString) > > > Thanks > Best Regards > >> On Thu, Sep 3, 2015 at 6:17 PM, Daniel Haviv >> <daniel.ha...@verac

Parsing Avro from Kafka Message

2015-09-03 Thread Daniel Haviv
Hi, I'm reading messages from Kafka where the value is an avro file. I would like to parse the contents of the message and work with it as a DataFrame, like with the spark-avro package but instead of files, pass it a RDD. How can this be achieved ? Thank you. Daniel

Starting a service with Spark Executors

2015-08-09 Thread Daniel Haviv
Hi, I'd like to start a service with each Spark Executor upon initalization and have the disributed code reference that service locally. What I'm trying to do is to invoke torch7 computations without reloading the model for each row by starting Lua http handler that will recieve http requests for

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Daniel Haviv
) :: (2,world) ::Nil).toDF.cache().registerTempTable(t) HiveThriftServer2.startWithContext(sqlContext) } Again, I'm not really clear what your use case is, but it does sound like the first link above is what you may want. -Todd On Wed, Aug 5, 2015 at 1:57 PM, Daniel Haviv daniel.ha

Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Daniel Haviv
Hi, Is it possible to start the Spark SQL thrift server from with a streaming app so the streamed data could be queried as it's goes in ? Thank you. Daniel

Local Repartition

2015-07-20 Thread Daniel Haviv
Hi, My data is constructed from a lot of small files which results in a lot of partitions per RDD. Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ? Thank you. Daniel

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
executors * 10, but I’m still trying to figure out the optimal number of partitions per executor. To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1) Cheers, Doug On Jul 20, 2015, at 5:04 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, My data

Re: Local Repartition

2015-07-20 Thread Daniel Haviv
to be aware of that as you decide when to call coalesce. Thanks, Silvio From: Daniel Haviv Date: Monday, July 20, 2015 at 4:59 PM To: Doug Balog Cc: user Subject: Re: Local Repartition Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting

DataFrame insertInto fails, saveAsTable works (Azure HDInsight)

2015-07-09 Thread Daniel Haviv
Hi, I'm running Spark 1.4 on Azure. DataFrame's insertInto fails, but when saveAsTable works. It seems like some issue with accessing Azure's blob storage but that doesn't explain why one type of write works and the other doesn't. This is the stack trace: Caused by:

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-08 Thread Daniel Haviv
...@hortonworks.com wrote: On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
. On 3 Jul 2015, at 08:33, Daniel Haviv daniel.ha...@veracity-group.com wrote: Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With binary i think it might not be possible, although if you can download

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv
/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext. Thanks Best Regards On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I've downloaded the pre-built binaries

Starting Spark without automatically starting HiveContext

2015-07-02 Thread Daniel Haviv
Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel

thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv
at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
by step guide in how to setup and use Spark in HDInsight. https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/ Jacob From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: Thursday, June 25, 2015 3:19 PM To: Silvio Fiorito Cc: user

Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv
library is up at http://spark-packages.org/package/granturing/spark-power-bi the Event Hubs library should be up soon. Thanks, Silvio From: Daniel Haviv Date: Thursday, June 25, 2015 at 1:37 PM To: user@spark.apache.org Subject: Using Spark on Azure Blob Storage Hi, I'm trying to use

Saving RDDs as custom output format

2015-04-14 Thread Daniel Haviv
Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Ensuring data locality when opening files

2015-03-09 Thread Daniel Haviv
Hi, We wrote a spark steaming app that receives file names on HDFS from Kafka and opens them using Hadoop's libraries. The problem with this method is that I'm not utilizing data locality because any worker might open any file without giving precedence to data locality. I can't open the files

using sparkContext from within a map function (from spark streaming app)

2015-03-08 Thread Daniel Haviv
Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread main org.apache.spark.SparkException: Task not serializable at

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Daniel Haviv
Quoting Michael: Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want:

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Daniel Haviv
problem.. 2014-12-09 22:58 GMT+08:00 Daniel Haviv danielru...@gmail.com: Hi, I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I get the following exception: 14/12/09 06:54:24 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/12/09

Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/deploy/SparkSubmit Caused by:

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv danielru...@gmail.com wrote: Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
Best Regards On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv danielru...@gmail.com wrote: That's the first thing I tried... still the same error: hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1 hdfs@ams-rsrv01:/tmp/spark

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
/datanucleus-core-3.2.2.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-rdbms-3.2.1.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-api-jdo-3.2.1.jar Thanks Best Regards On Tue, Dec 16, 2014 at 10:33 PM, Daniel Haviv danielru...@gmail.com wrote: Same here... # jar tf lib

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
it and see what are the contents? On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote: I've added every jar in the lib dir to my classpath and still no luck: CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv
are the contents? On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote: I've added every jar in the lib dir to my classpath and still no luck: CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core-3.2.2.jar:/tmp/spark/spark

Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-09 Thread Daniel Haviv
Hi, I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I get the following exception: 14/12/09 06:54:24 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/12/09 06:54:24 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 14/12/09

Starting the thrift server

2014-11-26 Thread Daniel Haviv
Hi, I'm trying to start the thrift server but failing: Exception in thread main java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353) at

Re: Remapping columns from a schemaRDD

2014-11-26 Thread Daniel Haviv
are rows inside of rows. If you wan to return a struct from a UDF you can do that with a case class. On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com wrote: Thank you. How can I address more complex columns like maps and structs? Thanks again! Daniel On 25 בנוב׳ 2014

Re: RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread Daniel Haviv
Prepend file:// to the path Daniel On 26 בנוב׳ 2014, at 20:15, firemonk9 dhiraj.peech...@gmail.com wrote: When I am running spark locally, RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt) and when I am running spark on YARN cluster, RDD

Re: Unable to use Kryo

2014-11-25 Thread Daniel Haviv
The problem was I didn't use the correct class name, it should be org.apache.spark.*serializer*.KryoSerializer On Mon, Nov 24, 2014 at 11:12 PM, Daniel Haviv danielru...@gmail.com wrote: Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error

Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile(/requests.loading) jsonFiles.registerTempTable(jRequests) val clean_jRequests=sqlContext.sql(select

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv
pipeline work, we have been considering adding something like: def modifyColumn(columnName, function) Any comments anyone has on this interface would be appreciated! Michael On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm selecting columns from

Unable to use Kryo

2014-11-24 Thread Daniel Haviv
Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error: java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking it up. Thanks for your help, Daniel

Converting a column to a map

2014-11-23 Thread Daniel Haviv
Hi, I have a column in my schemaRDD that is a map but I'm unable to convert it to a map.. I've tried converting it to a Tuple2[String,String]: val converted = jsonFiles.map(line= { line(10).asInstanceOf[Tuple2[String,String]]}) but I get ClassCastException: 14/11/23 11:51:30 WARN

Merging Parquet Files

2014-11-22 Thread Daniel Haviv
Hi, I'm ingesting a lot of small JSON files and convert them to unified parquet files, but even the unified files are fairly small (~10MB). I want to run a merge operation every hour on the existing files, but it takes a lot of time for such a small amount of data: about 3 GB spread of 3000

  1   2   >