Reading spark-env.sh from configured directory

2017-08-02 Thread Lior Chaga
Hi,

I have multiple spark deployments using mesos.
I use spark.executor.uri to fetch the spark distribution to executor node.

Every time I upgrade spark, I download the default distribution, and just
add to it custom spark-env.sh to spark/conf folder.

Further more, any change I want to do in spark-env.sh, forces me to
re-package the distribution.

Trying to find a way to provide the executors the location of spark conf
dir by using executor.extraJavaOptions
(-DSPARK_CONF_DIR=/path/on/worker/node), but it doesn't seem to work.

Any idea how I achieve it?

Thanks


spark on mesos cluster - metrics with graphite sink

2016-06-09 Thread Lior Chaga
Hi,
I'm launching spark application on mesos cluster.
The namespace of the metric includes the framework id for driver metrics,
and both framework id and executor id for executor metrics.
These ids are obviously assigned by mesos, and they are not permanent -
re-registering the application would result in different metric names.

Is there a way to change the metric namespace? I'd prefer switching the
framework id and executor ids with more logical names (like some custom
service name for the framework id, and ordinal executor id, instead the one
generated by mesos).

Thanks,
Lior.


Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-03-03 Thread Lior Chaga
No reference. I opened a ticket about missing documentation for it, and was
answered by Sean Owen that this is not meant for spark users. I explained
that it's an issue, but no news so far.

As for the memory management, I'm not experienced with it, but I suggest
you read: http://0x0fff.com/spark-memory-management/ and
http://0x0fff.com/spark-architecture/
Could be that the effective default storage memory in spark 1.6 is a bit
lower than in spark 1.5, and your application can't borrow from the
execution memory.



On Thu, Mar 3, 2016 at 2:35 AM, Koert Kuipers <ko...@tresata.com> wrote:

> with the locality issue resolved, i am still struggling with the new
> memory management.
>
> i am seeing tasks on tiny amounts of data take 15 seconds, of which 14 are
> spend in GC. with the legacy memory management (spark.memory.useLegacyMode
> = false ) they complete in 1 - 2 seconds.
>
> since we are permanently caching a very large number of RDDs, my suspicion
> is that with the new memory management these cached RDDs happily gobble up
> all the memory, and need to be evicted to run my small job, leading to the
> slowness.
>
> i can revert to legacy memory management mode, so this is not an issue,
> but i am worried that at some point the legacy memory management will be
> deprecated and then i am stuck with this performance issue.
>
> On Mon, Feb 29, 2016 at 12:47 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> setting spark.shuffle.reduceLocality.enabled=false worked for me, thanks
>>
>>
>> is there any reference to the benefits of setting reduceLocality to true?
>> i am tempted to disable it across the board.
>>
>> On Mon, Feb 29, 2016 at 9:51 AM, Yin Yang <yy201...@gmail.com> wrote:
>>
>>> The default value for spark.shuffle.reduceLocality.enabled is true.
>>>
>>> To reduce surprise to users of 1.5 and earlier releases, should the
>>> default value be set to false ?
>>>
>>> On Mon, Feb 29, 2016 at 5:38 AM, Lior Chaga <lio...@taboola.com> wrote:
>>>
>>>> Hi Koret,
>>>> Try spark.shuffle.reduceLocality.enabled=false
>>>> This is an undocumented configuration.
>>>> See:
>>>> https://github.com/apache/spark/pull/8280
>>>> https://issues.apache.org/jira/browse/SPARK-10567
>>>>
>>>> It solved the problem for me (both with and without memory legacy mode)
>>>>
>>>>
>>>> On Sun, Feb 28, 2016 at 11:16 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> i find it particularly confusing that a new memory management module
>>>>> would change the locations. its not like the hash partitioner got 
>>>>> replaced.
>>>>> i can switch back and forth between legacy and "new" memory management and
>>>>> see the distribution change... fully reproducible
>>>>>
>>>>> On Sun, Feb 28, 2016 at 11:24 AM, Lior Chaga <lio...@taboola.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I've experienced a similar problem upgrading from spark 1.4 to spark
>>>>>> 1.6.
>>>>>> The data is not evenly distributed across executors, but in my case
>>>>>> it also reproduced with legacy mode.
>>>>>> Also tried 1.6.1 rc-1, with same results.
>>>>>>
>>>>>> Still looking for resolution.
>>>>>>
>>>>>> Lior
>>>>>>
>>>>>> On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers <ko...@tresata.com>
>>>>>> wrote:
>>>>>>
>>>>>>> looking at the cached rdd i see a similar story:
>>>>>>> with useLegacyMode = true the cached rdd is spread out across 10
>>>>>>> executors, but with useLegacyMode = false the data for the cached rdd 
>>>>>>> sits
>>>>>>> on only 3 executors (the rest all show 0s). my cached RDD is a key-value
>>>>>>> RDD that got partitioned (hash partitioner, 50 partitions) before being
>>>>>>> cached.
>>>>>>>
>>>>>>> On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers <ko...@tresata.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> hello all,
>>>>>>>> we are just testing a semi-realtime application (it should return
>>>>>>>> results in less than 20 seconds from cached RDDs) on spark 1.6.0. 
>>>>>>>> before
>>>>>>>> this it used to run on spark 1.5.1
>>>>>>>>
>>>>>>>> in spark 1.6.0 the performance is similar to 1.5.1 if i set
>>>>>>>> spark.memory.useLegacyMode = true, however if i switch to
>>>>>>>> spark.memory.useLegacyMode = false the queries take about 50% to 100% 
>>>>>>>> more
>>>>>>>> time.
>>>>>>>>
>>>>>>>> the issue becomes clear when i focus on a single stage: the
>>>>>>>> individual tasks are not slower at all, but they run on less executors.
>>>>>>>> in my test query i have 50 tasks and 10 executors. both with
>>>>>>>> useLegacyMode = true and useLegacyMode = false the tasks finish in 
>>>>>>>> about 3
>>>>>>>> seconds and show as running PROCESS_LOCAL. however when  useLegacyMode 
>>>>>>>> =
>>>>>>>> false the tasks run on just 3 executors out of 10, while with 
>>>>>>>> useLegacyMode
>>>>>>>> = true they spread out across 10 executors. all the tasks running on 
>>>>>>>> just a
>>>>>>>> few executors leads to the slower results.
>>>>>>>>
>>>>>>>> any idea why this would happen?
>>>>>>>> thanks! koert
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-29 Thread Lior Chaga
Hi Koret,
Try spark.shuffle.reduceLocality.enabled=false
This is an undocumented configuration.
See:
https://github.com/apache/spark/pull/8280
https://issues.apache.org/jira/browse/SPARK-10567

It solved the problem for me (both with and without memory legacy mode)


On Sun, Feb 28, 2016 at 11:16 PM, Koert Kuipers <ko...@tresata.com> wrote:

> i find it particularly confusing that a new memory management module would
> change the locations. its not like the hash partitioner got replaced. i can
> switch back and forth between legacy and "new" memory management and see
> the distribution change... fully reproducible
>
> On Sun, Feb 28, 2016 at 11:24 AM, Lior Chaga <lio...@taboola.com> wrote:
>
>> Hi,
>> I've experienced a similar problem upgrading from spark 1.4 to spark 1.6.
>> The data is not evenly distributed across executors, but in my case it
>> also reproduced with legacy mode.
>> Also tried 1.6.1 rc-1, with same results.
>>
>> Still looking for resolution.
>>
>> Lior
>>
>> On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> looking at the cached rdd i see a similar story:
>>> with useLegacyMode = true the cached rdd is spread out across 10
>>> executors, but with useLegacyMode = false the data for the cached rdd sits
>>> on only 3 executors (the rest all show 0s). my cached RDD is a key-value
>>> RDD that got partitioned (hash partitioner, 50 partitions) before being
>>> cached.
>>>
>>> On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> hello all,
>>>> we are just testing a semi-realtime application (it should return
>>>> results in less than 20 seconds from cached RDDs) on spark 1.6.0. before
>>>> this it used to run on spark 1.5.1
>>>>
>>>> in spark 1.6.0 the performance is similar to 1.5.1 if i set
>>>> spark.memory.useLegacyMode = true, however if i switch to
>>>> spark.memory.useLegacyMode = false the queries take about 50% to 100% more
>>>> time.
>>>>
>>>> the issue becomes clear when i focus on a single stage: the individual
>>>> tasks are not slower at all, but they run on less executors.
>>>> in my test query i have 50 tasks and 10 executors. both with
>>>> useLegacyMode = true and useLegacyMode = false the tasks finish in about 3
>>>> seconds and show as running PROCESS_LOCAL. however when  useLegacyMode =
>>>> false the tasks run on just 3 executors out of 10, while with useLegacyMode
>>>> = true they spread out across 10 executors. all the tasks running on just a
>>>> few executors leads to the slower results.
>>>>
>>>> any idea why this would happen?
>>>> thanks! koert
>>>>
>>>>
>>>>
>>>
>>
>


Re: spark 1.6 new memory management - some issues with tasks not using all executors

2016-02-28 Thread Lior Chaga
Hi,
I've experienced a similar problem upgrading from spark 1.4 to spark 1.6.
The data is not evenly distributed across executors, but in my case it also
reproduced with legacy mode.
Also tried 1.6.1 rc-1, with same results.

Still looking for resolution.

Lior

On Fri, Feb 19, 2016 at 2:01 AM, Koert Kuipers  wrote:

> looking at the cached rdd i see a similar story:
> with useLegacyMode = true the cached rdd is spread out across 10
> executors, but with useLegacyMode = false the data for the cached rdd sits
> on only 3 executors (the rest all show 0s). my cached RDD is a key-value
> RDD that got partitioned (hash partitioner, 50 partitions) before being
> cached.
>
> On Thu, Feb 18, 2016 at 6:51 PM, Koert Kuipers  wrote:
>
>> hello all,
>> we are just testing a semi-realtime application (it should return results
>> in less than 20 seconds from cached RDDs) on spark 1.6.0. before this it
>> used to run on spark 1.5.1
>>
>> in spark 1.6.0 the performance is similar to 1.5.1 if i set
>> spark.memory.useLegacyMode = true, however if i switch to
>> spark.memory.useLegacyMode = false the queries take about 50% to 100% more
>> time.
>>
>> the issue becomes clear when i focus on a single stage: the individual
>> tasks are not slower at all, but they run on less executors.
>> in my test query i have 50 tasks and 10 executors. both with
>> useLegacyMode = true and useLegacyMode = false the tasks finish in about 3
>> seconds and show as running PROCESS_LOCAL. however when  useLegacyMode =
>> false the tasks run on just 3 executors out of 10, while with useLegacyMode
>> = true they spread out across 10 executors. all the tasks running on just a
>> few executors leads to the slower results.
>>
>> any idea why this would happen?
>> thanks! koert
>>
>>
>>
>


deleting application files in standalone cluster

2015-08-09 Thread Lior Chaga
Hi,
Using spark 1.4.0 in standalone mode, with following configuration:

SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
-Dspark.worker.cleanup.appDataTtl=86400

cleanup interval is set to default.

Application files are not deleted.

Using JavaSparkContext, and when the application ends it stops the context.

Maybe I should also call context.close()?

From what I understand, stop should be enough (
http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-stop-td17826.html#a17847
)

Thanks,

Lior


Use rank with distribute by in HiveContext

2015-07-16 Thread Lior Chaga
Does spark HiveContext support the rank() ... distribute by syntax (as in
the following article-
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive
)?

If not, how can it be achieved?

Thanks,
Lior


Re: spark sql - group by constant column

2015-07-15 Thread Lior Chaga
I found out the problem. Grouping by a constant column value is indeed
impossible.
The reason it was working in my project is that I gave the constant
column an alias that exists in the schema of the dataframe. The dataframe
contained a data_timestamp representing an hour, and I added to the
select a constant data_timestamp that represented the timestamp of the
day. And that was the cause for my original bug - I thought I was grouping
by the day timestamp, when I was actually grouping by each hour, and
therefore I got multiple rows for each of the group by combinations.

On Wed, Jul 15, 2015 at 10:09 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Facing a bug with group by in SparkSQL (version 1.4).
 Registered a JavaRDD with object containing integer fields as a table.

 Then I'm trying to do a group by, with a constant value in the group by
 fields:

 SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
 FROM tbl
 GROUP BY primary_one, primary_two, num


 I get the following exception:
 org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
 columns measure, primary_one, primary_two

 Tried both with HiveContext and SqlContext.
 The odd thing is that this kind of query actually works for me in a
 project I'm working on, but I have there another bug (the group by does not
 yield expected results).

 The only reason I can think of is that maybe in my real project, the
 context configuration is different.
 In my above example the configuration of the HiveContext is empty.

 In my real project, the configuration is shown below.
 Any ideas?

 Thanks,
 Lior

 Hive context configuration in project:
 (mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)
 (nfs3.mountd.port,4242)
 (mapreduce.tasktracker.healthchecker.script.timeout,60)
 (yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)
 (mapreduce.input.fileinputformat.input.dir.recursive,false)
 (hive.orc.compute.splits.num.threads,10)

 (mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)
 (hive.auto.convert.sortmerge.join.to.mapjoin,false)
 (hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)
 (hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)
  (hive.mapjoin.lazy.hashtable,true)
  (mapreduce.framework.name,local)
  (hive.exec.script.maxerrsize,10)
  (dfs.namenode.checkpoint.txns,100)
  (tfile.fs.output.buffer.size,262144)
  (yarn.app.mapreduce.am.job.task.listener.thread-count,30)
  (mapreduce.tasktracker.local.dir.minspacekill,0)
  (hive.support.concurrency,false)
  (fs.s3.block.size,67108864)

  (hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)
  (hive.stats.retries.max,0)
  (hadoop.hdfs.configuration.version,1)
  (dfs.bytes-per-checksum,512)
  (fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)
  (mapreduce.job.acl-view-job, )
  (hive.typecheck.on.insert,true)
  (mapreduce.jobhistory.loadedjobs.cache.size,5)
  (mapreduce.jobtracker.persist.jobstatus.hours,1)
  (hive.unlock.numretries,10)
  (dfs.namenode.handler.count,10)
  (mapreduce.input.fileinputformat.split.minsize,1)
  (hive.plan.serialization.format,kryo)
  (dfs.datanode.failed.volumes.tolerated,0)
  (yarn.resourcemanager.container.liveness-monitor.interval-ms,60)
  (yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)
  (yarn.resourcemanager.client.thread-count,50)
  (io.seqfile.compress.blocksize,100)
  (mapreduce.tasktracker.http.threads,40)
  (hive.explain.dependency.append.tasktype,false)
  (dfs.namenode.retrycache.expirytime.millis,60)
  (dfs.namenode.backup.address,0.0.0.0:50100)
  (hive.hwi.listen.host,0.0.0.0)
  (dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)
  (dfs.replication,3)
  (mapreduce.jobtracker.jobhistory.block.size,3145728)

  
 (dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})
  (mapreduce.task.profile.maps,0-2)
  (fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)
  (hive.stats.reliable,false)
  (yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)




spark sql - group by constant column

2015-07-15 Thread Lior Chaga
Hi,

Facing a bug with group by in SparkSQL (version 1.4).
Registered a JavaRDD with object containing integer fields as a table.

Then I'm trying to do a group by, with a constant value in the group by
fields:

SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
FROM tbl
GROUP BY primary_one, primary_two, num


I get the following exception:
org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
columns measure, primary_one, primary_two

Tried both with HiveContext and SqlContext.
The odd thing is that this kind of query actually works for me in a project
I'm working on, but I have there another bug (the group by does not yield
expected results).

The only reason I can think of is that maybe in my real project, the
context configuration is different.
In my above example the configuration of the HiveContext is empty.

In my real project, the configuration is shown below.
Any ideas?

Thanks,
Lior

Hive context configuration in project:
(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)
(nfs3.mountd.port,4242)
(mapreduce.tasktracker.healthchecker.script.timeout,60)
(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)
(mapreduce.input.fileinputformat.input.dir.recursive,false)
(hive.orc.compute.splits.num.threads,10)
(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)
(hive.auto.convert.sortmerge.join.to.mapjoin,false)
(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)
(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)
 (hive.mapjoin.lazy.hashtable,true)
 (mapreduce.framework.name,local)
 (hive.exec.script.maxerrsize,10)
 (dfs.namenode.checkpoint.txns,100)
 (tfile.fs.output.buffer.size,262144)
 (yarn.app.mapreduce.am.job.task.listener.thread-count,30)
 (mapreduce.tasktracker.local.dir.minspacekill,0)
 (hive.support.concurrency,false)
 (fs.s3.block.size,67108864)
 (hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)
 (hive.stats.retries.max,0)
 (hadoop.hdfs.configuration.version,1)
 (dfs.bytes-per-checksum,512)
 (fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)
 (mapreduce.job.acl-view-job, )
 (hive.typecheck.on.insert,true)
 (mapreduce.jobhistory.loadedjobs.cache.size,5)
 (mapreduce.jobtracker.persist.jobstatus.hours,1)
 (hive.unlock.numretries,10)
 (dfs.namenode.handler.count,10)
 (mapreduce.input.fileinputformat.split.minsize,1)
 (hive.plan.serialization.format,kryo)
 (dfs.datanode.failed.volumes.tolerated,0)
 (yarn.resourcemanager.container.liveness-monitor.interval-ms,60)
 (yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)
 (yarn.resourcemanager.client.thread-count,50)
 (io.seqfile.compress.blocksize,100)
 (mapreduce.tasktracker.http.threads,40)
 (hive.explain.dependency.append.tasktype,false)
 (dfs.namenode.retrycache.expirytime.millis,60)
 (dfs.namenode.backup.address,0.0.0.0:50100)
 (hive.hwi.listen.host,0.0.0.0)
 (dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)
 (dfs.replication,3)
 (mapreduce.jobtracker.jobhistory.block.size,3145728)
 
(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})
 (mapreduce.task.profile.maps,0-2)
 (fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)
 (hive.stats.reliable,false)
 (yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)


Re: Help optimising Spark SQL query

2015-06-22 Thread Lior Chaga
Hi James,

There are a few configurations that you can try:
https://spark.apache.org/docs/latest/sql-programming-guide.html#other-configuration-options

From my experience, the codegen really boost things up. Just run
sqlContext.sql(spark.sql.codegen=true) before you execute your query. But
keep in mind that sometimes this is buggy (depending on your query), so
compare to results without codegen to be sure.
Also you can try changing the default partitions.

You can also use dataframes (since 1.3). Not sure they are better than
specifying the query in 1.3, but with spark 1.4 there should be an enormous
performance improvement in dataframes.

Lior

On Mon, Jun 22, 2015 at 6:28 PM, James Aley james.a...@swiftkey.com wrote:

 Hello,

 A colleague of mine ran the following Spark SQL query:

 select
   count(*) as uses,
   count (distinct cast(id as string)) as users
 from usage_events
 where
   from_unixtime(cast(timestamp_millis/1000 as bigint))
 between '2015-06-09' and '2015-06-16'

 The table contains billions of rows, but totals only 64GB of data across
 ~30 separate files, which are stored as Parquet with LZO compression in S3.

 From the referenced columns:

 * id is Binary, which we cast to a String so that we can DISTINCT by it.
 (I was already told this will improve in a later release, in a separate
 thread.)
 * timestamp_millis is a long, containing a unix timestamp with
 millisecond resolution

 This took nearly 2 hours to run on a 5 node cluster of r3.xlarge EC2
 instances, using 20 executors, each with 4GB memory. I can see from
 monitoring tools that the CPU usage is at 100% on all nodes, but incoming
 network seems a bit low at 2.5MB/s, suggesting to me that this is CPU-bound.

 Does that seem slow? Can anyone offer any ideas by glancing at the query
 as to why this might be slow? We'll profile it meanwhile and post back if
 we find anything ourselves.

 A side issue - I've found that this query, and others, sometimes completes
 but doesn't return any results. There appears to be no error that I can see
 in the logs, and Spark reports the job as successful, but the connected
 JDBC client (SQLWorkbenchJ in this case), just sits there forever waiting.
 I did a quick Google and couldn't find anyone else having similar issues.


 Many thanks,

 James.



Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
I see that the pre-built distributions includes hive-shims-0.23 shaded in
spark-assembly jar (unlike when I make the distribution myself).
Does anyone knows what I should do to include the shims in my distribution?


On Thu, May 14, 2015 at 9:52 AM, Lior Chaga lio...@taboola.com wrote:

 Ultimately it was PermGen out of memory. I somehow missed it in the log

 On Thu, May 14, 2015 at 9:24 AM, Lior Chaga lio...@taboola.com wrote:

 After profiling with YourKit, I see there's an OutOfMemoryException in
 context SQLContext.applySchema. Again, it's a very small RDD. Each executor
 has 180GB RAM.

 On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Using spark sql with HiveContext. Spark version is 1.3.1
 When running local spark everything works fine. When running on spark
 cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
 This class belongs to hive-shims-0.23, and is a runtime dependency for
 spark-hive:

 [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.1
 [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
 [INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
 [INFO] |  |  \-
 org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime



 My spark distribution is:
 make-distribution.sh --tgz  -Phive -Phive-thriftserver -DskipTests


 If I try to add this dependency to my driver project, then the exception
 disappears, but then the task is stuck when registering an rdd as a table
 (I get timeout after 30 seconds). I should emphasize that the first rdd I
 register as a table is a very small one (about 60K row), and as I said - it
 runs swiftly in local.
 I suspect maybe other dependencies are missing, but they fail silently.

 Would be grateful if anyone knows how to solve it.

 Lior






Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
After profiling with YourKit, I see there's an OutOfMemoryException in
context SQLContext.applySchema. Again, it's a very small RDD. Each executor
has 180GB RAM.

On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Using spark sql with HiveContext. Spark version is 1.3.1
 When running local spark everything works fine. When running on spark
 cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
 This class belongs to hive-shims-0.23, and is a runtime dependency for
 spark-hive:

 [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.1
 [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
 [INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
 [INFO] |  |  \-
 org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime



 My spark distribution is:
 make-distribution.sh --tgz  -Phive -Phive-thriftserver -DskipTests


 If I try to add this dependency to my driver project, then the exception
 disappears, but then the task is stuck when registering an rdd as a table
 (I get timeout after 30 seconds). I should emphasize that the first rdd I
 register as a table is a very small one (about 60K row), and as I said - it
 runs swiftly in local.
 I suspect maybe other dependencies are missing, but they fail silently.

 Would be grateful if anyone knows how to solve it.

 Lior




Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
Ultimately it was PermGen out of memory. I somehow missed it in the log

On Thu, May 14, 2015 at 9:24 AM, Lior Chaga lio...@taboola.com wrote:

 After profiling with YourKit, I see there's an OutOfMemoryException in
 context SQLContext.applySchema. Again, it's a very small RDD. Each executor
 has 180GB RAM.

 On Thu, May 14, 2015 at 8:53 AM, Lior Chaga lio...@taboola.com wrote:

 Hi,

 Using spark sql with HiveContext. Spark version is 1.3.1
 When running local spark everything works fine. When running on spark
 cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
 This class belongs to hive-shims-0.23, and is a runtime dependency for
 spark-hive:

 [INFO] org.apache.spark:spark-hive_2.10:jar:1.3.1
 [INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
 [INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
 [INFO] |  |  +-
 org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
 [INFO] |  |  \-
 org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime



 My spark distribution is:
 make-distribution.sh --tgz  -Phive -Phive-thriftserver -DskipTests


 If I try to add this dependency to my driver project, then the exception
 disappears, but then the task is stuck when registering an rdd as a table
 (I get timeout after 30 seconds). I should emphasize that the first rdd I
 register as a table is a very small one (about 60K row), and as I said - it
 runs swiftly in local.
 I suspect maybe other dependencies are missing, but they fail silently.

 Would be grateful if anyone knows how to solve it.

 Lior





spark sql hive-shims

2015-05-13 Thread Lior Chaga
Hi,

Using spark sql with HiveContext. Spark version is 1.3.1
When running local spark everything works fine. When running on spark
cluster I get ClassNotFoundError org.apache.hadoop.hive.shims.Hadoop23Shims.
This class belongs to hive-shims-0.23, and is a runtime dependency for
spark-hive:

[INFO] org.apache.spark:spark-hive_2.10:jar:1.3.1
[INFO] +- org.spark-project.hive:hive-metastore:jar:0.13.1a:compile
[INFO] |  +- org.spark-project.hive:hive-shims:jar:0.13.1a:compile
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-common:jar:0.13.1a:compile
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-0.20:jar:0.13.1a:runtime
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-common-secure:jar:0.13.1a:compile
[INFO] |  |  +-
org.spark-project.hive.shims:hive-shims-0.20S:jar:0.13.1a:runtime
[INFO] |  |  \-
org.spark-project.hive.shims:hive-shims-0.23:jar:0.13.1a:runtime



My spark distribution is:
make-distribution.sh --tgz  -Phive -Phive-thriftserver -DskipTests


If I try to add this dependency to my driver project, then the exception
disappears, but then the task is stuck when registering an rdd as a table
(I get timeout after 30 seconds). I should emphasize that the first rdd I
register as a table is a very small one (about 60K row), and as I said - it
runs swiftly in local.
I suspect maybe other dependencies are missing, but they fail silently.

Would be grateful if anyone knows how to solve it.

Lior


mapping JavaRDD to jdbc DataFrame

2015-05-04 Thread Lior Chaga
Hi,

I'd like to use a JavaRDD containing parameters for an SQL query, and use
SparkSQL jdbc to load data from mySQL.

Consider the following pseudo code:

JavaRDDString namesRdd = ... ;
...
options.put(url, jdbc:mysql://mysql?user=usr);
options.put(password, pass);
options.put(dbtable, (SELECT * FROM mytable WHERE userName = ?)
sp_campaigns);
DataFrame myTableDF = m_sqlContext.load(jdbc, options);


I'm looking for a way to map namesRdd and get for each name the result of
the queries, without loosing spark context.

Using a mapping function doesn't seem like an option, because I don't have
SQLContext inside it.
I can only think of using collect, and than iterating over the string in
the RDD and execute the query, but it would run in the driver program.

Any suggestions?

Thanks,
Lior


Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Using Spark 1.2.0. Tried to apply register an RDD and got:
scala.MatchError: class java.util.Date (of class java.lang.Class)

I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562
(included in 1.2.0)

Anyone encountered this issue?

Thanks,
Lior


Re: Date class not supported by SparkSQL

2015-04-19 Thread Lior Chaga
Here's a code example:

public class DateSparkSQLExample {

public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName(test).setMaster(local);
JavaSparkContext sc = new JavaSparkContext(conf);

ListSomeObject itemsList = Lists.newArrayListWithCapacity(1);
itemsList.add(new SomeObject(new Date(), 1L));
JavaRDDSomeObject someObjectJavaRDD = sc.parallelize(itemsList);

JavaSQLContext sqlContext = new
org.apache.spark.sql.api.java.JavaSQLContext(sc);
sqlContext.applySchema(someObjectJavaRDD,
SomeObject.class).registerTempTable(temp_table);
}

private static class SomeObject implements Serializable{
private Date timestamp;
private Long value;

public SomeObject() {
}

public SomeObject(Date timestamp, Long value) {
this.timestamp = timestamp;
this.value = value;
}

public Date getTimestamp() {
return timestamp;
}

public void setTimestamp(Date timestamp) {
this.timestamp = timestamp;
}

public Long getValue() {
return value;
}

public void setValue(Long value) {
this.value = value;
}
}
}


On Sun, Apr 19, 2015 at 4:27 PM, Lior Chaga lio...@taboola.com wrote:

 Using Spark 1.2.0. Tried to apply register an RDD and got:
 scala.MatchError: class java.util.Date (of class java.lang.Class)

 I see it was resolved in https://issues.apache.org/jira/browse/SPARK-2562
 (included in 1.2.0)

 Anyone encountered this issue?

 Thanks,
 Lior



using log4j2 with spark

2015-03-04 Thread Lior Chaga
Hi,
Trying to run spark 1.2.1 w/ hadoop 1.0.4 on cluster and configure it to
run with log4j2.
Problem is that spark-assembly.jar contains log4j and slf4j classes
compatible with log4j 1.2 in it, and so it detects it should use log4j 1.2 (
https://github.com/apache/spark/blob/54e7b456dd56c9e52132154e699abca87563465b/core/src/main/scala/org/apache/spark/Logging.scala
on line 121).

Is there a maven profile for building spark-assembly w/out the log4j
dependencies, or any other way I can force spark to use log4j2?

Thanks!
Lior