Hi,
I'm trying to debug an issue with Spark so I've set log level to DEBUG but
at the same time I'd like to avoid the httpclient.wire's verbose output by
setting it to WARN.
I tried the following log4.properties config but I'm still getting DEBUG
outputs for httpclient.wire:
Hi,
I would like to expose a DF through the Thrift server, but even though I
enable spark.sql.hive.thriftServer.singleSession I still can't see the temp
table.
I'm using Spark 2.2.0:
spark-shell --conf spark.sql.hive.thriftServer.singleSession=true
import
Hi,
I'm trying to write a variant of date_add that accepts an interval as a
second parameter so that I could use the following syntax with SparkSQL:
select date_add(cast('1970-01-01' as date), interval 1 day)
but I'm getting the following error:
ValueError: (ValueError(u'Could not parse datatype:
Some code would help to debug the issue
On Fri, 8 Dec 2017 at 21:54 Afshin, Bardia <
bardia.afs...@changehealthcare.com> wrote:
> Using pyspark cli on spark 2.1.1 I’m getting out of memory issues when
> running the udf function on a recordset count of 10 with a mapping of the
> same value
Hi,
Is there a guide to writing a custom Structured Streaming receiver?
Thank you.
Daniel
Hi,
With RDDs it was possible to define a partitioner for two RDDS and given
that two RDDs have the same partitioner, a join operation would be
performed local to the partition without shuffling.
Can dataset joins be optimized in the same way ?
Is it enough to repartition two datasets on the the
Hi,
I'm using Spark 2.1.0 on Zeppelin.
I can successfully create a table but when I try to select from it I fail:
spark.sql("create table foo (name string)")
res0: org.apache.spark.sql.DataFrame = []
spark.sql("select * from foo")
org.apache.spark.sql.AnalysisException:
Hive support is required
There's no need to extend Spark's API, look at mapWithState for examples.
On Thu, Dec 8, 2016 at 4:49 AM, Anty Rao wrote:
>
>
> On Wed, Dec 7, 2016 at 7:42 PM, Anty Rao wrote:
>
>> Hi
>> I'm new to Spark. I'm doing some research to see if spark streaming
Hi Anty,
What you could do is keep in the state only the existence of a key and when
necessary pull it from a secondary state store like HDFS or HBASE.
Daniel
On Wed, Dec 7, 2016 at 1:42 PM, Anty Rao wrote:
> Hi
> I'm new to Spark. I'm doing some research to see if spark
Hi Samy,
If you're working with hive you could create a partitioned table and update
it's partitions' locations to the last version so when you'll query it
using spark, you'll always get the latest version.
Daniel
On Thu, Nov 17, 2016 at 9:05 PM, Samy Dindane wrote:
> Hi,
>
>
Hi,
Is it possible to use mapWithState without checkpointing at all ?
I'd rather have the whole application fail, restart and reload an
initialState RDD than pay for checkpointing every 10 batches.
Thank you,
Daniel
Hi,
I have a fairly simple stateful streaming job that suffers from high GC and
it's executors are killed as they are exceeding the size of the requested
container.
My current executor-memory is 10G, spark overhead is 2G and it's running
with one core.
At first the job begins running at a rate
Scratch that, it's working fine.
Thank you.
On Tue, Nov 8, 2016 at 8:19 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> I should have used transform instead of map
>
> val x: DStream[(String, Record)] =
> kafkaStream.transform(x=>{x
ue, Nov 8, 2016 at 7:46 PM, Daniel Haviv <
daniel.ha...@veracity-group.com> wrote:
> Hi,
> I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] :
>
> val kafkaStream = KafkaUtils.createDirectStream[String, String,
> StringDecoder, StringDecoder](ssc
Hi,
I'm trying to make a stateful stream of Tuple2[String, Dataset[Record]] :
val kafkaStream = KafkaUtils.createDirectStream[String, String,
StringDecoder, StringDecoder](ssc, kafkaParams, topicSet)
val stateStream: DStream[RDD[(String, Record)]] = kafkaStream.map(x=>
{
Hi,
I have a stateful streaming app where I pass a rather large initialState
RDD at the beginning.
No matter to how many partitions I divide the stateful stream I keep
failing on OOM or Java heap space.
Is there a way to make it more resilient?
how can I control it's storage level?
This is
Hi,
How can I utilize mapWithState and DataFrames?
Right now I stream json messages from Kafka, update their state, output the
updated state as json and compose a dataframe from it.
It seems inefficient both in terms of processing and storage (a long string
instead of a compact DF).
Is there a
Hi Baki,
It's enough for the producer to write the messages compressed. See here:
https://cwiki.apache.org/confluence/display/KAFKA/Compression
Thank you.
Daniel
> On 3 Nov 2016, at 21:27, Daniel Haviv <daniel.ha...@veracity-group.com> wrote:
>
> Hi,
> Kafka can compre
Hi,
Kafka can compress/uncompress your messages for you seamlessly, adding
compression on top of that will be redundant.
Thank you.
Daniel
> On 3 Nov 2016, at 20:53, bhayat wrote:
>
> Hello,
>
> I really wonder that whether i can stream compressed data with using
>
Hi,
I have the following scenario:
scala> val df = spark.sql("select * from danieltest3")
df: org.apache.spark.sql.DataFrame = [iid: string, activity: string ... 34
more fields]
Now I'm trying to map through the rows I'm getting:
scala> df.map(r=>r.toSeq)
:32: error: Unable to find encoder for
Hi,
I have a case where I use partitionBy to write my DF using a calculated
column, so it looks somethings like this:
val df = spark.sql("select *, from_unixtime(ts, 'MMddH')
partition_key from mytable")
df.write.partitionBy("partition_key").orc("/partitioned_table")
df is 8 partitions in
Hi,
How can the DirectOutputCommitter be utilized for writing ORC files?
I tried setting it via:
sc.getConf.set("spark.hadoop.mapred.output.committer.class","com.veracity-group.datastage.directorcwriter")
But I can still see a _temporary directory being used when I save my
dataframe as ORC.
Hi,
I'm running Spark using IntelliJ on Windows and even though I set
HADOOP_CONF_DIR it does not affect the contents of sc.hadoopConfiguration.
Anybody encountered it ?
Thanks,
Daniel
Or use mapWithState
Thank you.
Daniel
> On 16 Jul 2016, at 03:02, RK Aduri wrote:
>
> You can probably define sliding windows and set larger batch intervals.
>
>
>
> --
> View this message in context:
>
Hi,
I have a streaming application which uses a broadcast variable which I
populate from a database.
I would like every once in a while (or even every batch) to update/replace
the broadcast variable with the latest data from the database.
Only way I found online to do this is this "hackish" way (
Hi,
I want to to use an accumulableCollection of type mutable.ArrayBuffer[String
] to return invalid records detected during transformations but I don't
quite get it's type:
val errorAccumulator: Accumulable[ArrayBuffer[String], String] =
sc.accumulableCollection(mutable.ArrayBuffer[String]())
gt;
> On Sun, Jun 19, 2016 at 7:10 AM, Takeshi Yamamuro <linguin@gmail.com>
> wrote:
>
>> How about using `transient` annotations?
>>
>> // maropu
>>
>> On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv <
>> daniel.ha...@veracity-group.
> Anyway, I couldn't find a root cause only from the stacktraces...
>
> // maropu
>
>
>
>
> On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv <
> daniel.ha...@veracity-group.com> wrote:
>
>> Hi,
>> I've set spark.broadcast.factory to
>> org.ap
I'm using EC2 instances
Thank you.
Daniel
> On 9 Jun 2016, at 16:49, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
>
> Hi,
>
> are you using EC2 instances or local cluster behind firewall.
>
>
> Regards,
> Gourav Sengupta
>
>> On
Hi,
I've set these properties both in core-site.xml and hdfs-site.xml with no luck.
Thank you.
Daniel
> On 9 Jun 2016, at 01:11, Steve Loughran <ste...@hortonworks.com> wrote:
>
>
>> On 8 Jun 2016, at 16:34, Daniel Haviv <daniel.ha...@veracity-group.com>
>>
Hi,
I'm trying to create a table on s3a but I keep hitting the following error:
Exception in thread "main" org.apache.hadoop.hive.ql.metadata.HiveException:
MetaException(message:com.cloudera.com.amazonaws.AmazonClientException: Unable
to load AWS credentials from any provider in the chain)
I
Hi,
I'm wrapped the following code into a jar:
val test = sc.parallelize(Seq(("daniel", "a"), ("daniel", "b"), ("test", "1)")))
val agg = test.groupByKey()
agg.collect.foreach(r=>{println(r._1)})
The result of groupByKey is an empty RDD, when I'm trying the same
code using the spark-shell it's
parameter to switch broadcast
> method.
>
> Can you describe the issues with torrent broadcast in more detail ?
>
> Which version of Spark are you using ?
>
> Thanks
>
> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv <
> daniel.ha...@veracity-group.com> wrote:
>
&g
Hi,
Our application is failing due to issues with the torrent broadcast, is
there a way to switch to another broadcast method ?
Thank you.
Daniel
eption. Also, you can’t update the
> value of a broadcast variable, since it’s immutable.
>
> Thanks,
> Silvio
>
> From: Daniel Haviv <daniel.ha...@veracity-group.com>
> Date: Sunday, May 15, 2016 at 6:23 AM
> To: user <user@spark.apache.org>
> Subject: &qu
Hi,
I have a DStream I'd like to collect and broadcast it's values.
To do so I've created a mutable HashMap which i'm filling with foreachRDD
but when I'm checking it, it remains empty. If I use ArrayBuffer it works
as expected.
This is my code:
val arr =
Hi,
I'm running a very simple job (textFile->map->groupby->count) and hitting
this with Spark 1.6.0 on EMR 4.3 (Hadoop 2.7.1) and hitting this exception
when running on yarn-client and not in local mode:
16/05/11 10:29:26 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID
1,
How come that for the first() function it calculates an updated value and
for collect it doesn't ?
On Sun, May 8, 2016 at 4:17 PM, Ted Yu wrote:
> I don't think so.
> RDD is immutable.
>
> > On May 8, 2016, at 2:14 AM, Sisyphuss wrote:
> >
> >
lugin to work? I have CDH 5.5.2 installed which
> comes with HBase 1.0.0 and Phoenix 4.5.2. Do you think this will work?
>
> Thanks,
> Ben
>
>> On Apr 24, 2016, at 1:43 AM, Daniel Haviv <daniel.ha...@veracity-group.com>
>> wrote:
>>
>> Hi,
>>
Hi,
I tried saving DF to HBase using a hive table with hbase storage handler and
hiveContext but it failed due to a bug.
I was able to persist the DF to hbase using Apache Pheonix which was pretty
simple.
Thank you.
Daniel
> On 21 Apr 2016, at 16:52, Benjamin Kim wrote:
>
Hi,
We have a hive table which gets data written to it by two partition keys,
day and hour.
We would like to stream the incoming files assince fileStream can only
listen on one directory we start a streaming job on the latest partition
and every hour kill it and start a new one on a newer
Hi,
shouldn't groupByKey be avoided (
https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/best_practices/prefer_reducebykey_over_groupbykey.html)
?
Thank you,.
Daniel
On Wed, Mar 30, 2016 at 9:01 AM, Akhil Das
wrote:
> Isn't it what
Hi,
I'm trying to use the Spark Sink with Flume but it seems I'm missing some
of the dependencies.
I'm running the following code:
./bin/spark-shell --master yarn --jars
Hi,
I have an application running a set of transformations and finishes
with saveAsTextFile.
Out of 80 tasks all finish pretty fast but one that just hangs and
outputs these message to STDERR:
5/12/17 17:22:19 INFO collection.ExternalAppendOnlyMap: Thread 82
spilling in-memory map of 4.0 GB to
I will
Thank you.
> On 27 באוק׳ 2015, at 4:54, Felix Cheung wrote:
>
> Please open a JIRA?
>
>
> Date: Mon, 26 Oct 2015 15:32:42 +0200
> Subject: HiveContext ignores ("skip.header.line.count"="1")
> From: daniel.ha...@veracity-group.com
> To: user@spark.apache.org
Hi,
I have a csv table in Hive which is configured to skip the header row using
TBLPROPERTIES("skip.header.line.count"="1").
When querying from Hive the header row is not included in the data, but
when running the same query via HiveContext I get the header row.
I made sure that HiveContext sees
Hi,
We are inserting streaming data into a hive orc table via a simple insert
statement passed to HiveContext.
When trying to read the files generated using Hive 1.2.1 we are getting NPE:
at
org.apache.hadoop.hive.ql.exec.tez.MapRecordSource.processRow(MapRecordSource.java:91)
at
indow on the Dstream) - and it works as expected.
>
> Check this out for code example:
> https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzerStreamingSQL.scala
>
> -adrian
>
> From: Daniel Havi
Hi,
As things that run inside foreachRDD run at the driver, does that mean that
if we use SQLContext inside foreachRDD the data is sent back to the driver
and only then the query is executed or is it executed at the executors?
Thank you.
Daniel
out soon.
>
>
>
> Hao
>
>
>
> *From:* Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
> *Sent:* Friday, October 9, 2015 3:08 AM
> *To:* user
> *Subject:* Re: Insert via HiveContext is slow
>
>
>
> Forgot to mention that my insert is a multi table insert
microreflection,
dsParamLine['unerroreds'] unerroreds,
dsParamLine['corrected'] corrected,
dsParamLine['uncorrectables'] uncorrectables,
from_unixtime(cast(datats/1000 as bigint),'MMdd')
day_ts,
cmtsid
""")
On Thu,
Hi,
I'm inserting into a partitioned ORC table using an insert sql statement
passed via HiveContext.
The performance I'm getting is pretty bad and I was wondering if there are
ways to speed things up.
Would saving the DF like this
Hi,
I have a DStream which is a stream of RDD[String].
How can I pass a DStream to sqlContext.jsonRDD and work with it as a DF ?
Thank you.
Daniel
I tried but I'm getting the same error (task not serializable)
> On 25 בספט׳ 2015, at 20:10, Ted Yu <yuzhih...@gmail.com> wrote:
>
> Is the Schema.parse() call expensive ?
>
> Can you call it in the closure ?
>
>> On Fri, Sep 25, 2015 at 10:06 AM, Daniel Ha
Hi,
I'm getting a NotSerializableException even though I'm creating all the my
objects from within the closure:
import org.apache.avro.generic.GenericDatumReader
import java.io.File
import org.apache.avro._
val orig_schema = Schema.parse(new File("/home/wasabi/schema"))
val READER = new
Hi,
I'm trying to use KafkaUtils.createDirectStream to read avro messages from
Kafka but something is off with my type arguments:
val avroStream = KafkaUtils.createDirectStream[AvroKey[GenericRecord],
GenericRecord, NullWritable, AvroInputFormat[GenericRecord]](ssc,
kafkaParams, topicSet)
I'm
tty inefficient on almost any file system
>
> El martes, 22 de septiembre de 2015, Daniel Haviv
> <daniel.ha...@veracity-group.com> escribió:
>> Hi,
>> We are trying to load around 10k avro files (each file holds only one
>> record) using spark-avro but it take
Hi,
We are trying to load around 10k avro files (each file holds only one
record) using spark-avro but it takes over 15 minutes to load.
It seems that most of the work is being done at the driver where it created
a broadcast variable for each file.
Any idea why is it behaving that way ?
Thank
Hi,
I'm loading a 1000 files using the spark-avro package:
val df = sqlContext.read.avro(*"/incoming/"*)
When I'm performing an action on this df it seems like for each file a
broadcast is being created and is sent to the workers (instead of the
workers reading their data-local files):
scala>
l work outside of EMR, but it's worth a try.
>
> I've also used the ODBC driver from Hortonworks
> <http://hortonworks.com/products/releases/hdp-2-2/#add_ons>.
>
> Regards,
> Dan
>
> On Wed, Sep 16, 2015 at 8:34 AM, Daniel Haviv <
> daniel.ha...@veracity-group.com> w
Hi,
are there any free JDBC drivers for thrift ?
The only ones I could find are Simba's which require a license.
Thank,
Daniel
eDirectStream[AvroKey[GenericRecord],
> NullWritable, AvroKeyInputFormat[GenericRecord]](..)
>
> val avroData = avroStream.map(x => x._1.datum().toString)
>
>
> Thanks
> Best Regards
>
>> On Thu, Sep 3, 2015 at 6:17 PM, Daniel Haviv
>> <daniel.ha...@verac
Hi,
I'm reading messages from Kafka where the value is an avro file.
I would like to parse the contents of the message and work with it as a
DataFrame, like with the spark-avro package but instead of files, pass it a
RDD.
How can this be achieved ?
Thank you.
Daniel
Hi,
I'd like to start a service with each Spark Executor upon initalization and
have the disributed code reference that service locally.
What I'm trying to do is to invoke torch7 computations without reloading
the model for each row by starting Lua http handler that will recieve http
requests for
) :: (2,world)
::Nil).toDF.cache().registerTempTable(t)
HiveThriftServer2.startWithContext(sqlContext)
}
Again, I'm not really clear what your use case is, but it does sound like
the first link above is what you may want.
-Todd
On Wed, Aug 5, 2015 at 1:57 PM, Daniel Haviv
daniel.ha
Hi,
Is it possible to start the Spark SQL thrift server from with a streaming app
so the streamed data could be queried as it's goes in ?
Thank you.
Daniel
Hi,
My data is constructed from a lot of small files which results in a lot of
partitions per RDD.
Is there some way to locally repartition the RDD without shuffling so that
all of the partitions that reside on a specific node will become X
partitions on the same node ?
Thank you.
Daniel
executors * 10, but I’m still
trying to figure out the
optimal number of partitions per executor.
To get the number of executors,
sc.getConf.getInt(“spark.executor.instances”,-1)
Cheers,
Doug
On Jul 20, 2015, at 5:04 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
My data
to be aware of that as you decide when to call
coalesce.
Thanks,
Silvio
From: Daniel Haviv
Date: Monday, July 20, 2015 at 4:59 PM
To: Doug Balog
Cc: user
Subject: Re: Local Repartition
Thanks Doug,
coalesce might invoke a shuffle as well.
I don't think what I'm suggesting
Hi,
I'm running Spark 1.4 on Azure.
DataFrame's insertInto fails, but when saveAsTable works.
It seems like some issue with accessing Azure's blob storage but that
doesn't explain why one type of write works and the other doesn't.
This is the stack trace:
Caused by:
...@hortonworks.com
wrote:
On Thu, Jul 2, 2015 at 7:38 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob
storage jars but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
.
On 3 Jul 2015, at 08:33, Daniel Haviv daniel.ha...@veracity-group.com
wrote:
Thanks
I was looking for a less hack-ish way :)
Daniel
On Fri, Jul 3, 2015 at 10:15 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
With binary i think it might not be possible, although if you can download
/apache/spark/blob/master/repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L1023
which initializes the SQLContext.
Thanks
Best Regards
On Thu, Jul 2, 2015 at 6:11 PM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I've downloaded the pre-built binaries
Hi,
I've downloaded the pre-built binaries for Hadoop 2.6 and whenever I start
the spark-shell it always start with HiveContext.
How can I disable the HiveContext from being initialized automatically ?
Thanks,
Daniel
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage
jars but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at
at 7:38 AM, Daniel Haviv
daniel.ha...@veracity-group.com wrote:
Hi,
I'm trying to start the thrift-server and passing it azure's blob storage
jars but I'm failing on :
Caused by: java.io.IOException: No FileSystem for scheme: wasb
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass
by step guide in how to setup and use Spark in
HDInsight.
https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-spark-install/
Jacob
From: Daniel Haviv [mailto:daniel.ha...@veracity-group.com]
Sent: Thursday, June 25, 2015 3:19 PM
To: Silvio Fiorito
Cc: user
Hi,
I'm trying to use spark over Azure's HDInsight but the spark-shell fails
when starting:
java.io.IOException: No FileSystem for scheme: wasb
at
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at
library is up
at http://spark-packages.org/package/granturing/spark-power-bi the Event Hubs
library should be up soon.
Thanks,
Silvio
From: Daniel Haviv
Date: Thursday, June 25, 2015 at 1:37 PM
To: user@spark.apache.org
Subject: Using Spark on Azure Blob Storage
Hi,
I'm trying to use
Hi,
Is it possible to store RDDs as custom output formats, For example ORC?
Thanks,
Daniel
Hi,
We wrote a spark steaming app that receives file names on HDFS from Kafka
and opens them using Hadoop's libraries.
The problem with this method is that I'm not utilizing data locality
because any worker might open any file without giving precedence to data
locality.
I can't open the files
Hi,
We are designing a solution which pulls file paths from Kafka and for the
current stage just counts the lines in each of these files.
When running the code it fails on:
Exception in thread main org.apache.spark.SparkException: Task not
serializable
at
Quoting Michael:
Predicate push down into the input format is turned off by default because
there is a bug in the current parquet library that null pointers when there are
full row groups that are null.
https://issues.apache.org/jira/browse/SPARK-4258
You can turn it on if you want:
problem..
2014-12-09 22:58 GMT+08:00 Daniel Haviv danielru...@gmail.com:
Hi,
I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I
get the following exception:
14/12/09 06:54:24 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/12/09
Hi,
I've built spark successfully with maven but when I try to run spark-shell I
get the following errors:
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/spark/deploy/SparkSubmit
Caused by:
On Tue, Dec 16, 2014 at 5:04 PM, Daniel Haviv danielru...@gmail.com
wrote:
Hi,
I've built spark successfully with maven but when I try to run
spark-shell I get the following errors:
Spark assembly has been built with Hive, including Datanucleus jars on
classpath
Exception in thread main
Best Regards
On Tue, Dec 16, 2014 at 9:33 PM, Daniel Haviv danielru...@gmail.com
wrote:
That's the first thing I tried... still the same error:
hdfs@ams-rsrv01:~$ export CLASSPATH=/tmp/spark/spark-branch-1.1/lib
hdfs@ams-rsrv01:~$ cd /tmp/spark/spark-branch-1.1
hdfs@ams-rsrv01:/tmp/spark
/datanucleus-core-3.2.2.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-rdbms-3.2.1.jar:/home/akhld/mobi/localcluster/spark-1/lib/datanucleus-api-jdo-3.2.1.jar
Thanks
Best Regards
On Tue, Dec 16, 2014 at 10:33 PM, Daniel Haviv danielru...@gmail.com
wrote:
Same here...
# jar tf lib
it and see what are the contents?
On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote:
I've added every jar in the lib dir to my classpath and still no luck:
CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core
are the contents?
On 16 Dec 2014 22:46, Daniel Haviv danielru...@gmail.com wrote:
I've added every jar in the lib dir to my classpath and still no luck:
CLASSPATH=/tmp/spark/spark-branch-1.1/lib/datanucleus-api-jdo-3.2.1.jar:/tmp/spark/spark-branch-1.1/lib/datanucleus-core-3.2.2.jar:/tmp/spark/spark
Hi,
I've built spark 1.3 with hadoop 2.6 but when I startup the spark-shell I
get the following exception:
14/12/09 06:54:24 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:4040
14/12/09 06:54:24 INFO util.Utils: Successfully started service 'SparkUI'
on port 4040.
14/12/09
Hi,
I'm trying to start the thrift server but failing:
Exception in thread main java.lang.NoClassDefFoundError:
org/apache/tez/dag/api/SessionNotRunning
at
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:353)
at
are rows inside of rows. If you
wan to return a struct from a UDF you can do that with a case class.
On Tue, Nov 25, 2014 at 10:25 AM, Daniel Haviv danielru...@gmail.com
wrote:
Thank you.
How can I address more complex columns like maps and structs?
Thanks again!
Daniel
On 25 בנוב׳ 2014
Prepend file:// to the path
Daniel
On 26 בנוב׳ 2014, at 20:15, firemonk9 dhiraj.peech...@gmail.com wrote:
When I am running spark locally, RDD saveAsObjectFile writes the file to
local file system (ex : path /data/temp.txt)
and
when I am running spark on YARN cluster, RDD
The problem was I didn't use the correct class name, it should
be org.apache.spark.*serializer*.KryoSerializer
On Mon, Nov 24, 2014 at 11:12 PM, Daniel Haviv danielru...@gmail.com
wrote:
Hi,
I want to test Kryo serialization but when starting spark-shell I'm
hitting the following error
Hi,
I'm selecting columns from a json file, transform some of them and would
like to store the result as a parquet file but I'm failing.
This is what I'm doing:
val jsonFiles=sqlContext.jsonFile(/requests.loading)
jsonFiles.registerTempTable(jRequests)
val clean_jRequests=sqlContext.sql(select
pipeline work, we have been considering
adding something like:
def modifyColumn(columnName, function)
Any comments anyone has on this interface would be appreciated!
Michael
On Tue, Nov 25, 2014 at 7:02 AM, Daniel Haviv danielru...@gmail.com wrote:
Hi,
I'm selecting columns from
Hi,
I want to test Kryo serialization but when starting spark-shell I'm hitting
the following error:
java.lang.ClassNotFoundException: org.apache.spark.KryoSerializer
the kryo-2.21.jar is on the classpath so I'm not sure why it's not picking
it up.
Thanks for your help,
Daniel
Hi,
I have a column in my schemaRDD that is a map but I'm unable to convert it
to a map.. I've tried converting it to a Tuple2[String,String]:
val converted = jsonFiles.map(line= {
line(10).asInstanceOf[Tuple2[String,String]]})
but I get ClassCastException:
14/11/23 11:51:30 WARN
Hi,
I'm ingesting a lot of small JSON files and convert them to unified parquet
files, but even the unified files are fairly small (~10MB).
I want to run a merge operation every hour on the existing files, but it
takes a lot of time for such a small amount of data: about 3 GB spread of
3000
1 - 100 of 112 matches
Mail list logo