from:"Daniel Haviv"

Converting a json struct to map

2014-11-18 Thread Daniel Haviv

Hi, I'm loading a json file into a RDD and then save that RDD as parquet. One of the fields is a map of keys and values but it is being translated and stored as a struct. How can I convert the field into a map? Thanks, Daniel

Merging Parquet Files

2014-11-19 Thread Daniel Haviv

Hello, I'm writing a process that ingests json files and saves them a parquet files. The process is as such: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val jsonRequests=sqlContext.jsonFile(/requests) val parquetRequests=sqlContext.parquetFile(/requests_parquet)

Re: Merging Parquet Files

2014-11-19 Thread Daniel Haviv

Very cool thank you! On Wed, Nov 19, 2014 at 11:15 AM, Marius Soutier mps@gmail.com wrote: You can also insert into existing tables via .insertInto(tableName, overwrite). You just have to import sqlContext._ On 19.11.2014, at 09:41, Daniel Haviv danielru...@gmail.com wrote: Hello

Re: Converting a json struct to map

2014-11-19 Thread Daniel Haviv

mapper.registerModule(DefaultScalaModule) val myMap = mapper.readValue[Map[String,String]](json) myMap }) Thanks Best Regards On Wed, Nov 19, 2014 at 11:01 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm loading a json file into a RDD and then save

insertIntoTable failure deleted pre-existing _metadata file

2014-11-19 Thread Daniel Haviv

Hello, I'm loading and saving json files into an existing directory with parquet files using the insertIntoTable method. If the method fails for some reason (differences in the schema in my case), the _metadata file of the parquet dir is automatically deleted, rendering the existing parquet

Re: How to apply schema to queried data from Hive before saving it as parquet file?

2014-11-19 Thread Daniel Haviv

You can save the results as parquet file or as text file and created a hive table based on these files Daniel On 20 בנוב׳ 2014, at 08:01, akshayhazari akshayhaz...@gmail.com wrote: Sorry about the confusion I created . I just have started learning this week. Silly me, I was actually

SparkSQL exception handling

2014-11-20 Thread Daniel Haviv

Hi, I'm loading a bunch of json files and there seems to be problems with specific files (either schema changes or incomplete files). I'd like to catch the inconsistent files but I'm not sure how to do it. This is the exception I get: 14/11/20 00:13:49 INFO cluster.YarnClientClusterScheduler:

Re: SparkSQL exception handling

2014-11-20 Thread Daniel Haviv

and not doing anything with them } any ideas? Thanks, Daniel On Thu, Nov 20, 2014 at 10:20 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm loading a bunch of json files and there seems to be problems with specific files (either schema changes or incomplete files). I'd like to catch

Spark SQL Exception handling

2014-11-20 Thread Daniel Haviv

Hi Guys, I really need your help with this: I'm loading a bunch of json files uploaded via webhdfs, some of them have some incosistencies (the json ends mid-line for example) and that causes my whole application to fail. How can I continue processing valid json records without failing the

Re: Spark SQL Exception handling

2014-11-20 Thread Daniel Haviv

will be null. On Thu, Nov 20, 2014 at 7:34 AM, Daniel Haviv danielru...@gmail.com wrote: Hi Guys, I really need your help with this: I'm loading a bunch of json files uploaded via webhdfs, some of them have some incosistencies (the json ends mid-line for example) and that causes my whole

Re: How to join two RDDs with mutually exclusive keys

2014-11-20 Thread Daniel Haviv

There is probably a better way to do it but I would register both as temp tables and then join them via SQL. BR, Daniel On 20 בנוב׳ 2014, at 23:53, Harihar Nahak hna...@wynyardgroup.com wrote: I've similar type of issue, want to join two different type of RDD in one RDD file1.txt

short-circuit local reads cannot be used

2014-11-21 Thread Daniel Haviv

Hi, Everytime I start the spark-shell I encounter this message: 14/11/18 00:27:43 WARN hdfs.BlockReaderLocal: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Any idea how to overcome it ? the short-circuit feature is a big perfomance boost I don't want to

Merging Parquet Files

2014-11-22 Thread Daniel Haviv

Hi, I'm ingesting a lot of small JSON files and convert them to unified parquet files, but even the unified files are fairly small (~10MB). I want to run a merge operation every hour on the existing files, but it takes a lot of time for such a small amount of data: about 3 GB spread of 3000

Converting a column to a map

2014-11-23 Thread Daniel Haviv

Hi, I have a column in my schemaRDD that is a map but I'm unable to convert it to a map.. I've tried converting it to a Tuple2[String,String]: val converted = jsonFiles.map(line= { line(10).asInstanceOf[Tuple2[String,String]]}) but I get ClassCastException: 14/11/23 11:51:30 WARN

Unable to use Kryo

2014-11-24 Thread Daniel Haviv

Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error: java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking it up. Thanks for your help, Daniel

Re: Unable to use Kryo

2014-11-25 Thread Daniel Haviv

The problem was I didn't use the correct class name, it should be org.apache.spark.*serializer*.KryoSerializer On Mon, Nov 24, 2014 at 11:12 PM, Daniel Haviv danielru...@gmail.com wrote: Hi, I want to test Kryo serialization but when starting spark-shell I'm hitting the following error

Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv

Hi, I'm selecting columns from a json file, transform some of them and would like to store the result as a parquet file but I'm failing. This is what I'm doing: val jsonFiles=sqlContext.jsonFile(/requests.loading) jsonFiles.registerTempTable(jRequests) val clean_jRequests=sqlContext.sql(select

Re: Remapping columns from a schemaRDD

2014-11-25 Thread Daniel Haviv

pipeline work, we have been considering adding something like: def modifyColumn(columnName, function) Any comments anyone has on this interface would be appreciated! Michael On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote: Hi, I'm selecting columns from

Starting the thrift server

2014-11-26 Thread Daniel Haviv

Hi, I'm trying to start the thrift server but failing: Exception in thread main java.lang.NoClassDefFoundError: org/apache/tez/dag/api/SessionNotRunning at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353) at

Re: Remapping columns from a schemaRDD

2014-11-26 Thread Daniel Haviv

are rows inside of rows. If you wan to return a struct from a UDF you can do that with a case class. On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com wrote: Thank you. How can I address more complex columns like maps and structs? Thanks again! Daniel On 25 בנוב׳ 2014

Re: RDD saveAsObjectFile write to local file and HDFS

2014-11-26 Thread Daniel Haviv

Prepend file:// to the path Daniel On 26 בנוב׳ 2014, at 20:15, firemonk9 dhiraj.peech...@gmail.com wrote: When I am running spark locally, RDD saveAsObjectFile writes the file to local file system (ex : path /data/temp.txt) and when I am running spark on YARN cluster, RDD

Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-09 Thread Daniel Haviv

Hi, I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I get the following exception: 14/12/09 06:54:24 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/12/09 06:54:24 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 14/12/09

Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/deploy/SparkSubmit Caused by:

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv danielru...@gmail.com wrote: Hi, I've built spark successfully with maven but when I try to run spark-shell I get the following errors: Spark assembly has been built with Hive, including Datanucleus jars on classpath Exception in thread main

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

Best Regards On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv danielru...@gmail.com wrote: That's the first thing I tried... still the same error: hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1 hdfs@ams-rsrv01:/tmp/spark

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

/datanucleus-core-3.2.2.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-rdbms-3.2.1.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-api-jdo-3.2.1.jar Thanks Best Regards On Tue, Dec 16, 2014 at 10:33 PM, Daniel Haviv danielru...@gmail.com wrote: Same here... # jar tf lib

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

it and see what are the contents? On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote: I've added every jar in the lib dir to my classpath and still no luck: CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core

Re: Could not find the main class: org.apache.spark.deploy.SparkSubmit

2014-12-16 Thread Daniel Haviv

are the contents? On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote: I've added every jar in the lib dir to my classpath and still no luck: CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core-3.2.2.jar:/tmp/spark/spark

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-17 Thread Daniel Haviv

problem.. 2014-12-09 22:58 GMT+08:00 Daniel Haviv danielru...@gmail.com: Hi, I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I get the following exception: 14/12/09 06:54:24 INFO server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:4040 14/12/09

Re: Why Parquet Predicate Pushdown doesn't work?

2015-01-06 Thread Daniel Haviv

Quoting Michael: Predicate push down into the input format is turned off by default because there is a bug in the current parquet library that null pointers when there are full row groups that are null. https://issues.apache.org/jira/browse/SPARK-4258 You can turn it on if you want:

Ensuring data locality when opening files

2015-03-09 Thread Daniel Haviv

Hi, We wrote a spark steaming app that receives file names on HDFS from Kafka and opens them using Hadoop's libraries. The problem with this method is that I'm not utilizing data locality because any worker might open any file without giving precedence to data locality. I can't open the files

using sparkContext from within a map function (from spark streaming app)

2015-03-08 Thread Daniel Haviv

Hi, We are designing a solution which pulls file paths from Kafka and for the current stage just counts the lines in each of these files. When running the code it fails on: Exception in thread main org.apache.spark.SparkException: Task not serializable at

Saving RDDs as custom output format

2015-04-14 Thread Daniel Haviv

Hi, Is it possible to store RDDs as custom output formats, For example ORC? Thanks, Daniel

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv

by step guide in how to setup and use Spark in HDInsight. https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/ Jacob From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com] Sent: Thursday, June 25, 2015 3:19 PM To: Silvio Fiorito Cc: user

Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv

Hi, I'm trying to use spark over Azure's HDInsight but the spark-shell fails when starting: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Daniel Haviv

library is up at http://spark-packages.org/package/granturing/spark-power-bi the Event Hubs library should be up soon. Thanks, Silvio From: Daniel Haviv Date: Thursday, June 25, 2015 at 1:37 PM To: user@spark.apache.org Subject: Using Spark on Azure Blob Storage Hi, I'm trying to use

Starting Spark without automatically starting HiveContext

2015-07-02 Thread Daniel Haviv

Hi, I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start the spark-shell it always start with HiveContext. How can I disable the HiveContext from being initialized automatically ? Thanks, Daniel

thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv

Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584) at

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv

. On 3 Jul 2015, at 08:33, Daniel Haviv daniel.ha...@veracity-group.com wrote: Thanks I was looking for a less hack-ish way :) Daniel On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com wrote: With binary i think it might not be possible, although if you can download

Re: Starting Spark without automatically starting HiveContext

2015-07-03 Thread Daniel Haviv

/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023 which initializes the SQLContext. Thanks Best Regards On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I've downloaded the pre-built binaries

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-02 Thread Daniel Haviv

at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb at org.apache.hadoop.fs.FileSystem.getFileSystemClass

Starting Spark SQL thrift server from within a streaming app

2015-08-05 Thread Daniel Haviv

Hi, Is it possible to start the Spark SQL thrift server from with a streaming app so the streamed data could be queried as it's goes in ? Thank you. Daniel

Starting a service with Spark Executors

2015-08-09 Thread Daniel Haviv

Hi, I'd like to start a service with each Spark Executor upon initalization and have the disributed code reference that service locally. What I'm trying to do is to invoke torch7 computations without reloading the model for each row by starting Lua http handler that will recieve http requests for

Re: Starting Spark SQL thrift server from within a streaming app

2015-08-06 Thread Daniel Haviv

) :: (2,world) ::Nil).toDF.cache().registerTempTable(t) HiveThriftServer2.startWithContext(sqlContext) } Again, I'm not really clear what your use case is, but it does sound like the first link above is what you may want. -Todd On Wed, Aug 5, 2015 at 1:57 PM, Daniel Haviv daniel.ha

Local Repartition

2015-07-20 Thread Daniel Haviv

Hi, My data is constructed from a lot of small files which results in a lot of partitions per RDD. Is there some way to locally repartition the RDD without shuffling so that all of the partitions that reside on a specific node will become X partitions on the same node ? Thank you. Daniel

Re: Local Repartition

2015-07-20 Thread Daniel Haviv

executors * 10, but I’m still trying to figure out the optimal number of partitions per executor. To get the number of executors, sc.getConf.getInt(“spark.executor.instances”,-1) Cheers, Doug On Jul 20, 2015, at 5:04 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, My data

Re: Local Repartition

2015-07-20 Thread Daniel Haviv

to be aware of that as you decide when to call coalesce. Thanks, Silvio From: Daniel Haviv Date: Monday, July 20, 2015 at 4:59 PM To: Doug Balog Cc: user Subject: Re: Local Repartition Thanks Doug, coalesce might invoke a shuffle as well. I don't think what I'm suggesting

Re: HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv

I will Thank you. > On 27 באוק׳ 2015, at 4:54, Felix Cheung wrote: > > Please open a JIRA? > > > Date: Mon, 26 Oct 2015 15:32:42 +0200 > Subject: HiveContext ignores ("skip.header.line.count"="1") > From: daniel.ha...@veracity-group.com > To: user@spark.apache.org

HiveContext ignores ("skip.header.line.count"="1")

2015-10-26 Thread Daniel Haviv

Hi, I have a csv table in Hive which is configured to skip the header row using TBLPROPERTIES("skip.header.line.count"="1"). When querying from Hive the header row is not included in the data, but when running the same query via HiveContext I get the header row. I made sure that HiveContext sees

Re: Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv

microreflection, dsParamLine['unerroreds'] unerroreds, dsParamLine['corrected'] corrected, dsParamLine['uncorrectables'] uncorrectables, from_unixtime(cast(datats/1000 as bigint),'MMdd') day_ts, cmtsid """) On Thu,

Re: SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv

indow on the Dstream) - and it works as expected. > > Check this out for code example: > https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala > > -adrian > > From: Daniel Havi

SQLContext within foreachRDD

2015-10-12 Thread Daniel Haviv

Hi, As things that run inside foreachRDD run at the driver, does that mean that if we use SQLContext inside foreachRDD the data is sent back to the driver and only then the query is executed or is it executed at the executors? Thank you. Daniel

Re: Insert via HiveContext is slow

2015-10-09 Thread Daniel Haviv

out soon. > > > > Hao > > > > *From:* Daniel Haviv [mailto:daniel.ha...@veracity-group.com] > *Sent:* Friday, October 9, 2015 3:08 AM > *To:* user > *Subject:* Re: Insert via HiveContext is slow > > > > Forgot to mention that my insert is a multi table insert

Generated ORC files cause NPE in Hive

2015-10-13 Thread Daniel Haviv

Hi, We are inserting streaming data into a hive orc table via a simple insert statement passed to HiveContext. When trying to read the files generated using Hive 1.2.1 we are getting NPE: at org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91) at

Insert via HiveContext is slow

2015-10-08 Thread Daniel Haviv

Hi, I'm inserting into a partitioned ORC table using an insert sql statement passed via HiveContext. The performance I'm getting is pretty bad and I was wondering if there are ways to speed things up. Would saving the DF like this

Re: thrift-server does not load jars files (Azure HDInsight)

2015-07-08 Thread Daniel Haviv

...@hortonworks.com wrote: On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv daniel.ha...@veracity-group.com wrote: Hi, I'm trying to start the thrift-server and passing it azure's blob storage jars but I'm failing on : Caused by: java.io.IOException: No FileSystem for scheme: wasb

DataFrame insertInto fails, saveAsTable works (Azure HDInsight)

2015-07-09 Thread Daniel Haviv

Hi, I'm running Spark 1.4 on Azure. DataFrame's insertInto fails, but when saveAsTable works. It seems like some issue with accessing Azure's blob storage but that doesn't explain why one type of write works and the other doesn't. This is the stack trace: Caused by:

Re: Parsing Avro from Kafka Message

2015-09-04 Thread Daniel Haviv

eDirectStream[AvroKey[GenericRecord], > NullWritable, AvroKeyInputFormat[GenericRecord]](..) > > val avroData = avroStream.map(x => x._1.datum().toString) > > > Thanks > Best Regards > >> On Thu, Sep 3, 2015 at 6:17 PM, Daniel Haviv >> <daniel.ha...@verac

Parsing Avro from Kafka Message

2015-09-03 Thread Daniel Haviv

Hi, I'm reading messages from Kafka where the value is an avro file. I would like to parse the contents of the message and work with it as a DataFrame, like with the spark-avro package but instead of files, pass it a RDD. How can this be achieved ? Thank you. Daniel

Spark Thrift Server JDBC Drivers

2015-09-16 Thread Daniel Haviv

Hi, are there any free JDBC drivers for thrift ? The only ones I could find are Simba's which require a license. Thank, Daniel

Re: Spark Thrift Server JDBC Drivers

2015-09-17 Thread Daniel Haviv

l work outside of EMR, but it's worth a try. > > I've also used the ODBC driver from Hortonworks > <http://hortonworks.com/products/releases/hdp-2-2/#add_ons>. > > Regards, > Dan > > On Wed, Sep 16, 2015 at 8:34 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> w

Converting a DStream to schemaRDD

2015-09-29 Thread Daniel Haviv

Hi, I have a DStream which is a stream of RDD[String]. How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ? Thank you. Daniel

Reading avro data using KafkaUtils.createDirectStream

2015-09-24 Thread Daniel Haviv

Hi, I'm trying to use KafkaUtils.createDirectStream to read avro messages from Kafka but something is off with my type arguments: val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord], GenericRecord, NullWritable, AvroInputFormat[GenericRecord]](ssc, kafkaParams, topicSet) I'm

Re: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv

I tried but I'm getting the same error (task not serializable) > On 25 בספט׳ 2015, at 20:10, Ted Yu <yuzhih...@gmail.com> wrote: > > Is the Schema.parse() call expensive ? > > Can you call it in the closure ? > >> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Ha

java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema

2015-09-25 Thread Daniel Haviv

Hi, I'm getting a NotSerializableException even though I'm creating all the my objects from within the closure: import org.apache.avro.generic.GenericDatumReader import java.io.File import org.apache.avro._ val orig_schema = Schema.parse(new File("/home/wasabi/schema")) val READER = new

Re: spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv

tty inefficient on almost any file system > > El martes, 22 de septiembre de 2015, Daniel Haviv > <daniel.ha...@veracity-group.com> escribió: >> Hi, >> We are trying to load around 10k avro files (each file holds only one >> record) using spark-avro but it take

spark-avro takes a lot time to load thousands of files

2015-09-22 Thread Daniel Haviv

Hi, We are trying to load around 10k avro files (each file holds only one record) using spark-avro but it takes over 15 minutes to load. It seems that most of the work is being done at the driver where it created a broadcast variable for each file. Any idea why is it behaving that way ? Thank

sqlContext.read.avro broadcasting files from the driver

2015-09-21 Thread Daniel Haviv

Hi, I'm loading a 1000 files using the spark-avro package: val df = sqlContext.read.avro(*"/incoming/"*) When I'm performing an action on this df it seems like for each file a broadcast is being created and is sent to the workers (instead of the workers reading their data-local files): scala>

One task hangs and never finishes

2015-12-17 Thread Daniel Haviv

Hi, I have an application running a set of transformations and finishes with saveAsTextFile. Out of 80 tasks all finish pretty fast but one that just hangs and outputs these message to STDERR: 5/12/17 17:22:19 INFO collection.ExternalAppendOnlyMap: Thread 82 spilling in-memory map of 4.0 GB to

HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv

Hi, I'm trying to create a table on s3a but I keep hitting the following error: Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain) I

Re: Switching broadcast mechanism from torrrent

2016-06-06 Thread Daniel Haviv

parameter to switch broadcast > method. > > Can you describe the issues with torrent broadcast in more detail ? > > Which version of Spark are you using ? > > Thanks > > On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > &g

groupByKey returns an emptyRDD

2016-06-06 Thread Daniel Haviv

Hi, I'm wrapped the following code into a jar: val test = sc.parallelize(Seq(("daniel", "a"), ("daniel", "b"), ("test", "1)"))) val agg = test.groupByKey() agg.collect.foreach(r=>{println(r._1)}) The result of groupByKey is an empty RDD, when I'm trying the same code using the spark-shell it's

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-10 Thread Daniel Haviv

I'm using EC2 instances Thank you. Daniel > On 9 Jun 2016, at 16:49, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > Hi, > > are you using EC2 instances or local cluster behind firewall. > > > Regards, > Gourav Sengupta > >> On

Re: HiveContext: Unable to load AWS credentials from any provider in the chain

2016-06-08 Thread Daniel Haviv

Hi, I've set these properties both in core-site.xml and hdfs-site.xml with no luck. Thank you. Daniel > On 9 Jun 2016, at 01:11, Steve Loughran <ste...@hortonworks.com> wrote: > > >> On 8 Jun 2016, at 16:34, Daniel Haviv <daniel.ha...@veracity-group.com> >>

Switching broadcast mechanism from torrrent

2016-06-01 Thread Daniel Haviv

Hi, Our application is failing due to issues with the torrent broadcast, is there a way to switch to another broadcast method ? Thank you. Daniel

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Daniel Haviv

> Anyway, I couldn't find a root cause only from the stacktraces... > > // maropu > > > > > On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv < > daniel.ha...@veracity-group.com> wrote: > >> Hi, >> I've set spark.broadcast.factory to >> org.ap

Re: Switching broadcast mechanism from torrrent

2016-06-20 Thread Daniel Haviv

gt; > On Sun, Jun 19, 2016 at 7:10 AM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> How about using `transient` annotations? >> >> // maropu >> >> On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv < >> daniel.ha...@veracity-group.

Confusion regarding sc.accumulableCollection(mutable.ArrayBuffer[String]()) type

2016-06-23 Thread Daniel Haviv

Hi, I want to to use an accumulableCollection of type mutable.ArrayBuffer[String ] to return invalid records detected during transformations but I don't quite get it's type: val errorAccumulator: Accumulable[ArrayBuffer[String], String] = sc.accumulableCollection(mutable.ArrayBuffer[String]())

Flume with Spark Streaming Sink

2016-03-20 Thread Daniel Haviv

Hi, I'm trying to use the Spark Sink with Flume but it seems I'm missing some of the dependencies. I'm running the following code: ./bin/spark-shell --master yarn --jars

Re: aggregateByKey on PairRDD

2016-03-30 Thread Daniel Haviv

Hi, shouldn't groupByKey be avoided ( https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html) ? Thank you,. Daniel On Wed, Mar 30, 2016 at 9:01 AM, Akhil Das wrote: > Isn't it what

Spark fileStream from a partitioned hive dir

2016-04-13 Thread Daniel Haviv

Hi, We have a hive table which gets data written to it by two partition keys, day and hour. We would like to stream the incoming files assince fileStream can only listen on one directory we start a streaming job on the latest partition and every hour kill it and start a new one on a newer

Re: Save DataFrame to HBase

2016-04-27 Thread Daniel Haviv

lugin to work? I have CDH 5.5.2 installed which > comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work? > > Thanks, > Ben > >> On Apr 24, 2016, at 1:43 AM, Daniel Haviv <daniel.ha...@veracity-group.com> >> wrote: >> >> Hi, >>

Re: Save DataFrame to HBase

2016-04-24 Thread Daniel Haviv

Hi, I tried saving DF to HBase using a hive table with hbase storage handler and hiveContext but it failed due to a bug. I was able to persist the DF to hbase using Apache Pheonix which was pretty simple. Thank you. Daniel > On 21 Apr 2016, at 16:52, Benjamin Kim wrote: >

java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B

2016-05-11 Thread Daniel Haviv

Hi, I'm running a very simple job (textFile->map->groupby->count) and hitting this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception when running on yarn-client and not in local mode: 16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1,

"collecting" DStream data

2016-05-15 Thread Daniel Haviv

Hi, I have a DStream I'd like to collect and broadcast it's values. To do so I've created a mutable HashMap which i'm filling with foreachRDD but when I'm checking it, it remains empty. If I use ArrayBuffer it works as expected. This is my code: val arr =

Re: "collecting" DStream data

2016-05-15 Thread Daniel Haviv

eption. Also, you can’t update the > value of a broadcast variable, since it’s immutable. > > Thanks, > Silvio > > From: Daniel Haviv <daniel.ha...@veracity-group.com> > Date: Sunday, May 15, 2016 at 6:23 AM > To: user <user@spark.apache.org> > Subject: &qu

Re: Is it a bug?

2016-05-09 Thread Daniel Haviv

How come that for the first() function it calculates an updated value and for collect it doesn't ? On Sun, May 8, 2016 at 4:17 PM, Ted Yu wrote: > I don't think so. > RDD is immutable. > > > On May 8, 2016, at 2:14 AM, Sisyphuss wrote: > > > >

Using DirectOutputCommitter with ORC

2016-07-25 Thread Daniel Haviv

Hi, How can the DirectOutputCommitter be utilized for writing ORC files? I tried setting it via: sc.getConf.set("spark.hadoop.mapred.output.committer.class","com.veracity-group.datastage.directorcwriter") But I can still see a _temporary directory being used when I save my dataframe as ORC.

Spark Streaming: Refreshing broadcast value after each batch

2016-07-12 Thread Daniel Haviv

Hi, I have a streaming application which uses a broadcast variable which I populate from a database. I would like every once in a while (or even every batch) to update/replace the broadcast variable with the latest data from the database. Only way I found online to do this is this "hackish" way (

Spark (on Windows) not picking up HADOOP_CONF_DIR

2016-07-17 Thread Daniel Haviv

Hi, I'm running Spark using IntelliJ on Windows and even though I set HADOOP_CONF_DIR it does not affect the contents of sc.hadoopConfiguration. Anybody encountered it ? Thanks, Daniel

Re: Spark Streaming - Best Practices to handle multiple datapoints arriving at different time interval

2016-07-16 Thread Daniel Haviv

Or use mapWithState Thank you. Daniel > On 16 Jul 2016, at 03:02, RK Aduri wrote: > > You can probably define sliding windows and set larger batch intervals. > > > > -- > View this message in context: >

[Spark-SQL] Hive support is required to select over the following tables

2017-02-08 Thread Daniel Haviv

Hi, I'm using Spark 2.1.0 on Zeppelin. I can successfully create a table but when I try to select from it I fail: spark.sql("create table foo (name string)") res0: org.apache.spark.sql.DataFrame = [] spark.sql("select * from foo") org.apache.spark.sql.AnalysisException: Hive support is required

mapWithState with Datasets

2016-11-08 Thread Daniel Haviv

Hi, I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] : val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSet) val stateStream: DStream[RDD[(String, Record)]] = kafkaStream.map(x=> {

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv

ue, Nov 8, 2016 at 7:46 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] : > > val kafkaStream = KafkaUtils.createDirectStream[String, String, > StringDecoder, StringDecoder](ssc

Re: mapWithState with Datasets

2016-11-08 Thread Daniel Haviv

Scratch that, it's working fine. Thank you. On Tue, Nov 8, 2016 at 8:19 PM, Daniel Haviv < daniel.ha...@veracity-group.com> wrote: > Hi, > I should have used transform instead of map > > val x: DStream[(String, Record)] = > kafkaStream.transform(x=>{x

mapWithState and DataFrames

2016-11-06 Thread Daniel Haviv

Hi, How can I utilize mapWithState and DataFrames? Right now I stream json messages from Kafka, update their state, output the updated state as json and compose a dataframe from it. It seems inefficient both in terms of processing and storage (a long string instead of a compact DF). Is there a

mapWithState with a big initial RDD gets OOM'ed

2016-11-07 Thread Daniel Haviv

Hi, I have a stateful streaming app where I pass a rather large initialState RDD at the beginning. No matter to how many partitions I divide the stateful stream I keep failing on OOM or Java heap space. Is there a way to make it more resilient? how can I control it's storage level? This is

partitionBy produces wrong number of tasks

2016-10-19 Thread Daniel Haviv

Hi, I have a case where I use partitionBy to write my DF using a calculated column, so it looks somethings like this: val df = spark.sql("select *, from_unixtime(ts, 'MMddH') partition_key from mytable") df.write.partitionBy("partition_key").orc("/partitioned_table") df is 8 partitions in

mapWithState job slows down & exceeds yarn's memory limits

2016-11-14 Thread Daniel Haviv

Hi, I have a fairly simple stateful streaming job that suffers from high GC and it's executors are killed as they are exceeding the size of the requested container. My current executor-memory is 10G, spark overhead is 2G and it's running with one core. At first the job begins running at a rate

Using mapWithState without a checkpoint

2016-11-17 Thread Daniel Haviv

Hi, Is it possible to use mapWithState without checkpointing at all ? I'd rather have the whole application fail, restart and reload an initialState RDD than pay for checkpointing every 10 batches. Thank you, Daniel

1 2 >

1 - 100 of 112 matches

Mail list logo