Re: question on SPARK_WORKER_CORES

2017-02-17 Thread Alex Kozlov
I found in some previous CDH versions that Spark starts only one executor
per Spark slave, and DECREASING the executor-cores in standalone makes the
total # of executors go up.  Just my 2¢.

On Fri, Feb 17, 2017 at 5:20 PM, kant kodali <kanth...@gmail.com> wrote:

> Hi Satish,
>
> I am using spark 2.0.2.  And no I have not passed those variables because
> I didn't want to shoot in the dark. According to the documentation it looks
> like SPARK_WORKER_CORES is the one which should do it. If not, can you
> please explain how these variables inter play together?
>
> --num-executors
> --executor-cores
> –total-executor-cores
> SPARK_WORKER_CORES
>
> Thanks!
>
>
> On Fri, Feb 17, 2017 at 5:13 PM, Satish Lalam <sati...@microsoft.com>
> wrote:
>
>> Have you tried passing --executor-cores or –total-executor-cores as
>> arguments, , depending on the spark version?
>>
>>
>>
>>
>>
>> *From:* kant kodali [mailto:kanth...@gmail.com]
>> *Sent:* Friday, February 17, 2017 5:03 PM
>> *To:* Alex Kozlov <ale...@gmail.com>
>> *Cc:* user @spark <user@spark.apache.org>
>> *Subject:* Re: question on SPARK_WORKER_CORES
>>
>>
>>
>> Standalone.
>>
>>
>>
>> On Fri, Feb 17, 2017 at 5:01 PM, Alex Kozlov <ale...@gmail.com> wrote:
>>
>> What Spark mode are you running the program in?
>>
>>
>>
>> On Fri, Feb 17, 2017 at 4:55 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>> when I submit a job using spark shell I get something like this
>>
>>
>>
>> [Stage 0:>(36814 + 4) / 220129]
>>
>>
>>
>> Now all I want is I want to increase number of parallel tasks running
>> from 4 to 16 so I exported an env variable called SPARK_WORKER_CORES=16 in
>> conf/spark-env.sh. I though that should do it but it doesn't. It still
>> shows me 4. any idea?
>>
>>
>>
>> Thanks much!
>>
>>
>>
>>
>>
>> --
>>
>> Alex Kozlov
>> (408) 507-4987
>> (650) 887-2135 efax
>> ale...@gmail.com
>>
>>
>>
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com


Re: question on SPARK_WORKER_CORES

2017-02-17 Thread Alex Kozlov
What Spark mode are you running the program in?

On Fri, Feb 17, 2017 at 4:55 PM, kant kodali <kanth...@gmail.com> wrote:

> when I submit a job using spark shell I get something like this
>
> [Stage 0:>(36814 + 4) / 220129]
>
>
> Now all I want is I want to increase number of parallel tasks running from
> 4 to 16 so I exported an env variable called SPARK_WORKER_CORES=16 in
> conf/spark-env.sh. I though that should do it but it doesn't. It still
> shows me 4. any idea?
>
>
> Thanks much!
>
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com


Re: Processing millions of messages in milliseconds -- Architecture guide required

2016-04-19 Thread Alex Kozlov
This is too big of a topic.  For starters, what is the latency between you
obtain the data and the data is available for analysis?  Obviously if this
is < 5 minutes, you probably need a streaming solution.  How fast the
"micro batches of seconds" need to be available for analysis?  Can the
problem be easily partitioned and how flexible are you in the # of machines
for your solution?  Are you OK with availability fat tails?

Another question, how big is an individual message in bytes?  XML/JSON are
extremely inefficient and with "10 mils of messages" you might hit other
bottlenecks like network unless you convert them into something more
machine-like like Protobuf, Avro or Thrift.

>From the top, look at Kafka, Flume, Storm.

To "serve the historical  data in milliseconds (may be upto 30 days of
data)" you'll need to cache data in memory.  The question, again, is how
often the data change.  You might look into Lambda architectures.

On Mon, Apr 18, 2016 at 10:21 PM, Prashant Sharma <scrapco...@gmail.com>
wrote:

> Hello Deepak,
>
> It is not clear what you want to do. Are you talking about spark streaming
> ? It is possible to process historical data in Spark batch mode too. You
> can add a timestamp field in xml/json. Spark documentation is at
> spark.apache.org. Spark has good inbuilt features to process json and
> xml[1] messages.
>
> Thanks,
> Prashant Sharma
>
> 1. https://github.com/databricks/spark-xml
>
> On Tue, Apr 19, 2016 at 10:31 AM, Deepak Sharma <deepakmc...@gmail.com>
> wrote:
>
>> Hi all,
>> I am looking for an architecture to ingest 10 mils of messages in the
>> micro batches of seconds.
>> If anyone has worked on similar kind of architecture  , can you please
>> point me to any documentation around the same like what should be the
>> architecture , which all components/big data ecosystem tools should i
>> consider etc.
>> The messages has to be in xml/json format , a preprocessor engine or
>> message enhancer and then finally a processor.
>> I thought about using data cache as well for serving the data
>> The data cache should have the capability to serve the historical  data
>> in milliseconds (may be upto 30 days of data)
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>>
>>
--
Alex Kozlov
ale...@gmail.com


Re: sparkR issues ?

2016-03-15 Thread Alex Kozlov
Hi Roni, you can probably rename the as.data.frame in
$SPARK_HOME/R/pkg/R/DataFrame.R and re-install SparkR by running
install-dev.sh

On Tue, Mar 15, 2016 at 8:46 AM, roni <roni.epi...@gmail.com> wrote:

> Hi ,
>  Is there a work around for this?
>  Do i need to file a bug for this?
> Thanks
> -R
>
> On Tue, Mar 15, 2016 at 12:28 AM, Sun, Rui <rui@intel.com> wrote:
>
>> It seems as.data.frame() defined in SparkR convers the versions in R base
>> package.
>>
>> We can try to see if we can change the implementation of as.data.frame()
>> in SparkR to avoid such covering.
>>
>>
>>
>> *From:* Alex Kozlov [mailto:ale...@gmail.com]
>> *Sent:* Tuesday, March 15, 2016 2:59 PM
>> *To:* roni <roni.epi...@gmail.com>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: sparkR issues ?
>>
>>
>>
>> This seems to be a very unfortunate name collision.  SparkR defines it's
>> own DataFrame class which shadows what seems to be your own definition.
>>
>>
>>
>> Is DataFrame something you define?  Can you rename it?
>>
>>
>>
>> On Mon, Mar 14, 2016 at 10:44 PM, roni <roni.epi...@gmail.com> wrote:
>>
>> Hi,
>>
>>  I am working with bioinformatics and trying to convert some scripts to
>> sparkR to fit into other spark jobs.
>>
>>
>>
>> I tries a simple example from a bioinf lib and as soon as I start sparkR
>> environment it does not work.
>>
>>
>>
>> code as follows -
>>
>> countData <- matrix(1:100,ncol=4)
>>
>> condition <- factor(c("A","A","B","B"))
>>
>> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~
>> condition)
>>
>>
>>
>> Works if i dont initialize the sparkR environment.
>>
>>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
>> following error
>>
>>
>>
>> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
>> condition)
>>
>> Error in DataFrame(colData, row.names = rownames(colData)) :
>>
>>   cannot coerce class "data.frame" to a DataFrame
>>
>>
>>
>> I am really stumped. I am not using any spark function , so i would
>> expect it to work as a simple R code.
>>
>> why it does not work?
>>
>>
>>
>> Appreciate  the help
>>
>> -R
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Alex Kozlov
>> (408) 507-4987
>> (650) 887-2135 efax
>> ale...@gmail.com
>>
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com


Re: sparkR issues ?

2016-03-15 Thread Alex Kozlov
This seems to be a very unfortunate name collision.  SparkR defines it's
own DataFrame class which shadows what seems to be your own definition.

Is DataFrame something you define?  Can you rename it?

On Mon, Mar 14, 2016 at 10:44 PM, roni <roni.epi...@gmail.com> wrote:

> Hi,
>  I am working with bioinformatics and trying to convert some scripts to
> sparkR to fit into other spark jobs.
>
> I tries a simple example from a bioinf lib and as soon as I start sparkR
> environment it does not work.
>
> code as follows -
> countData <- matrix(1:100,ncol=4)
> condition <- factor(c("A","A","B","B"))
> dds <- DESeqDataSetFromMatrix(countData, DataFrame(condition), ~ condition)
>
> Works if i dont initialize the sparkR environment.
>  if I do library(SparkR) and sqlContext <- sparkRSQL.init(sc)  it gives
> following error
>
> > dds <- DESeqDataSetFromMatrix(countData, as.data.frame(condition), ~
> condition)
> Error in DataFrame(colData, row.names = rownames(colData)) :
>   cannot coerce class "data.frame" to a DataFrame
>
> I am really stumped. I am not using any spark function , so i would expect
> it to work as a simple R code.
> why it does not work?
>
> Appreciate  the help
> -R
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com


Re: Spark on RAID

2016-03-08 Thread Alex Kozlov
Parallel disk IO?  But the effect should be less noticeable compared to
Hadoop which reads/writes a lot.  Much depends on how often Spark persists
on disk.  Depends on the specifics of the RAID controller as well.

If you write to HDFS as opposed to local file system this may be a big
factor as well.

On Tue, Mar 8, 2016 at 8:34 AM, Eddie Esquivel <eduardo.esqui...@gmail.com>
wrote:

> Hello All,
> In the Spark documentation under "Hardware Requirements" it very clearly
> states:
>
> We recommend having *4-8 disks* per node, configured *without* RAID (just
> as separate mount points)
>
> My question is why not raid? What is the argument\reason for not using
> Raid?
>
> Thanks!
> -Eddie
>

--
Alex Kozlov


Re: How to query a hive table from inside a map in Spark

2016-02-14 Thread Alex Kozlov
While this is possible via jdbc calls, it is not the best practice: you
should probably use variable broadcasting
<http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables>
instead.

On Sun, Feb 14, 2016 at 8:40 PM, SRK <swethakasire...@gmail.com> wrote:

> Hi,
>
> Is it possible to query a hive table which has data stored in the form of a
> parquet file from inside map/partitions in Spark? My requirement is that I
> have a User table in Hive/hdfs and for each record inside a sessions RDD, I
> should be able to query the User table and if the User table already has a
> record for that userId, query the record and do further processing.
>
>
> Thanks!
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-query-a-hive-table-from-inside-a-map-in-Spark-tp26224.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Alex Kozlov
(408) 507-4987
(650) 887-2135 efax
ale...@gmail.com


Re: Best practises of share Spark cluster over few applications

2016-02-14 Thread Alex Kozlov
Praveen, the mode in which you run spark (standalone, yarn, mesos) is
determined when you create SparkContext
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext>.
You are right that spark-submit and spark-shell create different
SparkContexts.

In general, resource sharing is the task of the cluster scheduler and there
are only 3 choices by default right now.

On Sun, Feb 14, 2016 at 9:13 PM, praveen S <mylogi...@gmail.com> wrote:

> Even i was trying to launch spark jobs from webservice :
>
> But I thought you could run spark jobs in yarn mode only through
> spark-submit. Is my understanding not correct?
>
> Regards,
> Praveen
> On 15 Feb 2016 08:29, "Sabarish Sasidharan" <
> sabarish.sasidha...@manthan.com> wrote:
>
>> Yes you can look at using the capacity scheduler or the fair scheduler
>> with YARN. Both allow using full cluster when idle. And both allow
>> considering cpu plus memory when allocating resources which is sort of
>> necessary with Spark.
>>
>> Regards
>> Sab
>> On 13-Feb-2016 10:11 pm, "Eugene Morozov" <evgeny.a.moro...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have several instances of the same web-service that is running some ML
>>> algos on Spark (both training and prediction) and do some Spark unrelated
>>> job. Each web-service instance creates their on JavaSparkContext, thus
>>> they're seen as separate applications by Spark, thus they're configured
>>> with separate limits of resources such as cores (I'm not concerned about
>>> the memory as much as about cores).
>>>
>>> With this set up, say 3 web service instances, each of them has just 1/3
>>> of cores. But it might happen, than only one instance is going to use
>>> Spark, while others are busy with Spark unrelated. I'd like in this case
>>> all Spark cores be available for the one that's in need.
>>>
>>> Ideally I'd like Spark cores just be available in total and the first
>>> app who needs it, takes as much as required from the available at the
>>> moment. Is it possible? I believe Mesos is able to set resources free if
>>> they're not in use. Is it possible with YARN?
>>>
>>> I'd appreciate if you could share your thoughts or experience on the
>>> subject.
>>>
>>> Thanks.
>>> --
>>> Be well!
>>> Jean Morozov
>>>
>>

--
Alex Kozlov
ale...@gmail.com


Re: How can I disable logging when running local[*]?

2015-10-07 Thread Alex Kozlov
Hmm, clearly the parameter is not passed to the program.  This should be an
activator issue.  I wonder how do you specify the other parameters, like
driver memory, num cores, etc.?  Just out of curiosity, can you run a
program:

import org.apache.spark.SparkConf
val out=new SparkConf(true).get("spark.driver.extraJavaOptions")

in your env and see what the output is?

Also, make sure spark-defaults.conf is on your classpath.

On Tue, Oct 6, 2015 at 11:19 AM, Jeff Jones <jjo...@adaptivebiotech.com>
wrote:

> Here’s an example. I echoed JAVA_OPTS so that you can see what I’ve got.
> Then I call ‘activator run’ in the project directory.
>
>
> jjones-mac:analyzer-perf jjones$ echo $JAVA_OPTS
>
> -Xmx4g -Xmx4g
> -Dlog4j.configuration=file:/Users/jjones/src/adaptive/adaptiveobjects/analyzer-perf/conf/log4j.properties
>
> jjones-mac:analyzer-perf jjones$ activator run
>
> [info] Loading project definition from
> /Users/jjones/src/adaptive/adaptiveobjects/analyzer-perf/project
>
> [info] Set current project to analyzer-perf (in build
> file:/Users/jjones/src/adaptive/adaptiveobjects/analyzer-perf/)
>
> [info] Running com.adaptive.analyzer.perf.AnalyzerPerf
>
> 11:15:24.066 [run-main-0] INFO  org.apache.spark.SparkContext - Running
> Spark version 1.4.1
>
> 11:15:24.150 [run-main-0] DEBUG o.a.h.m.lib.MutableMetricsFactory - field
> org.apache.hadoop.metrics2.lib.MutableRate
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess
> with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=,
> always=false, sampleName=Ops, type=DEFAULT, value=[Rate of successful
> kerberos logins and latency (milliseconds)], valueName=Time)
>
> 11:15:24.156 [run-main-0] DEBUG o.a.h.m.lib.MutableMetricsFactory - field
> org.apache.hadoop.metrics2.lib.MutableRate
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure
> with annotation @org.apache.hadoop.metrics2.annotation.Metric(about=,
> always=false, sampleName=Ops, type=DEFAULT, value=[Rate of failed kerberos
> logins and latency (milliseconds)], valueName=Time)
>
> As I mentioned below but repeated for completeness, I also have this in my
> code.
>
> import org.apache.log4j.PropertyConfigurator
>
> PropertyConfigurator.configure("conf/log4j.properties")
> Logger.getRootLogger().setLevel(Level.OFF)
> Logger.getLogger("org").setLevel(Level.OFF)
> Logger.getLogger("akka").setLevel(Level.OFF)
>
> And here’s my log4j.properties (note, I’ve also tried setting the level to
> OFF):
>
> # Set everything to be logged to the console
>
> log4j.rootCategory=WARN
>
> log4j.appender.console=org.apache.log4j.ConsoleAppender
>
> log4j.appender.console.layout=org.apache.log4j.PatternLayout
>
> log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p
> %c{1}: %m%n
>
>
> # Change this to set Spark log level
>
> log4j.logger.org.apache.spark=WARN
>
>
> # Silence akka remoting
>
> log4j.logger.Remoting=WARN
>
>
> # Ignore messages below warning level from Jetty, because it's a bit
> verbose
>
> log4j.logger.org.eclipse.jetty=WARN
>
>
> spark.log.threshold=OFF
>
> spark.root.logger=OFF,DRFA
>
>
> From: Alex Kozlov
> Date: Tuesday, October 6, 2015 at 10:50 AM
>
> To: Jeff Jones
> Cc: "user@spark.apache.org"
> Subject: Re: How can I disable logging when running local[*]?
>
> Try
>
> JAVA_OPTS='-Dlog4j.configuration=file:/'
>
> Internally, this is just spark.driver.extraJavaOptions, which you should
> be able to set in conf/spark-defaults.conf
>
> Can you provide more details how you invoke the driver?
>
> On Tue, Oct 6, 2015 at 9:48 AM, Jeff Jones <jjo...@adaptivebiotech.com>
> wrote:
>
>> Thanks. Any chance you know how to pass this to a Scala app that is run
>> via TypeSafe activator?
>>
>> I tried putting it $JAVA_OPTS but I get:
>>
>> Unrecognized option: --driver-java-options
>>
>> Error: Could not create the Java Virtual Machine.
>>
>> Error: A fatal exception has occurred. Program will exit.
>>
>>
>> I tried a bunch of different quoting but nothing produced a good result.
>> I also tried passing it directly to activator using –jvm but it still
>> produces the same results with verbose logging. Is there a way I can tell
>> if it’s picking up my file?
>>
>>
>>
>> From: Alex Kozlov
>> Date: Monday, October 5, 2015 at 8:34 PM
>> To: Jeff Jones
>> Cc: "user@spark.apache.org"
>> Subject: Re: How can I disable logging when running local[*]?
>>
>> Did you try “--driver-java-options
>> '-Dlog4j.configuration=file:/'” 

Re: How can I disable logging when running local[*]?

2015-10-06 Thread Alex Kozlov
Try

JAVA_OPTS='-Dlog4j.configuration=file:/'

Internally, this is just spark.driver.extraJavaOptions, which you should be
able to set in conf/spark-defaults.conf

Can you provide more details how you invoke the driver?

On Tue, Oct 6, 2015 at 9:48 AM, Jeff Jones <jjo...@adaptivebiotech.com>
wrote:

> Thanks. Any chance you know how to pass this to a Scala app that is run
> via TypeSafe activator?
>
> I tried putting it $JAVA_OPTS but I get:
>
> Unrecognized option: --driver-java-options
>
> Error: Could not create the Java Virtual Machine.
>
> Error: A fatal exception has occurred. Program will exit.
>
>
> I tried a bunch of different quoting but nothing produced a good result. I
> also tried passing it directly to activator using –jvm but it still
> produces the same results with verbose logging. Is there a way I can tell
> if it’s picking up my file?
>
>
>
> From: Alex Kozlov
> Date: Monday, October 5, 2015 at 8:34 PM
> To: Jeff Jones
> Cc: "user@spark.apache.org"
> Subject: Re: How can I disable logging when running local[*]?
>
> Did you try “--driver-java-options
> '-Dlog4j.configuration=file:/'” and setting the
> log4j.rootLogger=FATAL,console?
>
> On Mon, Oct 5, 2015 at 8:19 PM, Jeff Jones <jjo...@adaptivebiotech.com>
> wrote:
>
>> I’ve written an application that hosts the Spark driver in-process using
>> “local[*]”. I’ve turned off logging in my conf/log4j.properties file. I’ve
>> also tried putting the following code prior to creating my SparkContext.
>> These were coupled together from various posts I’ve. None of these steps
>> have worked. I’m still getting a ton of logging to the console. Anything
>> else I can try?
>>
>> Thanks,
>> Jeff
>>
>> private def disableLogging(): Unit = {
>>   import org.apache.log4j.PropertyConfigurator
>>
>>   PropertyConfigurator.configure("conf/log4j.properties")
>>   Logger.getRootLogger().setLevel(Level.OFF)
>>   Logger.getLogger("org").setLevel(Level.OFF)
>>   Logger.getLogger("akka").setLevel(Level.OFF)
>> }
>>
>>
>>
>> This message (and any attachments) is intended only for the designated
>> recipient(s). It
>> may contain confidential or proprietary information, or have other
>> limitations on use as
>> indicated by the sender. If you are not a designated recipient, you may
>> not review, use,
>> copy or distribute this message. If you received this in error, please
>> notify the sender by
>> reply e-mail and delete this message.
>>
>
>
>
> --
> Alex Kozlov
> (408) 507-4987
> (408) 830-9982 fax
> (650) 887-2135 efax
> ale...@gmail.com
>
>
> This message (and any attachments) is intended only for the designated
> recipient(s). It
> may contain confidential or proprietary information, or have other
> limitations on use as
> indicated by the sender. If you are not a designated recipient, you may
> not review, use,
> copy or distribute this message. If you received this in error, please
> notify the sender by
> reply e-mail and delete this message.
>


Re: How can I disable logging when running local[*]?

2015-10-05 Thread Alex Kozlov
Did you try “--driver-java-options
'-Dlog4j.configuration=file:/'” and setting the
log4j.rootLogger=FATAL,console?

On Mon, Oct 5, 2015 at 8:19 PM, Jeff Jones <jjo...@adaptivebiotech.com>
wrote:

> I’ve written an application that hosts the Spark driver in-process using
> “local[*]”. I’ve turned off logging in my conf/log4j.properties file. I’ve
> also tried putting the following code prior to creating my SparkContext.
> These were coupled together from various posts I’ve. None of these steps
> have worked. I’m still getting a ton of logging to the console. Anything
> else I can try?
>
> Thanks,
> Jeff
>
> private def disableLogging(): Unit = {
>   import org.apache.log4j.PropertyConfigurator
>
>   PropertyConfigurator.configure("conf/log4j.properties")
>   Logger.getRootLogger().setLevel(Level.OFF)
>   Logger.getLogger("org").setLevel(Level.OFF)
>   Logger.getLogger("akka").setLevel(Level.OFF)
> }
>
>
>
> This message (and any attachments) is intended only for the designated
> recipient(s). It
> may contain confidential or proprietary information, or have other
> limitations on use as
> indicated by the sender. If you are not a designated recipient, you may
> not review, use,
> copy or distribute this message. If you received this in error, please
> notify the sender by
> reply e-mail and delete this message.
>



-- 
Alex Kozlov
(408) 507-4987
(408) 830-9982 fax
(650) 887-2135 efax
ale...@gmail.com


Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
Thank you - it works if the file is created in Spark

On Mon, Sep 7, 2015 at 3:06 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> Read response from Cheng Lian <lian.cs@gmail.com> on Aug/27th - it
> looks the same problem.
>
> Workarounds
> 1. write that parquet file in Spark;
> 2. upgrade to Spark 1.5.
>
> --
> Ruslan Dautkhanov
>
> On Mon, Sep 7, 2015 at 3:52 PM, Alex Kozlov <ale...@gmail.com> wrote:
>
>> No, it was created in Hive by CTAS, but any help is appreciated...
>>
>> On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
>> wrote:
>>
>>> That parquet table wasn't created in Spark, is it?
>>>
>>> There was a recent discussion on this list that complex data types in
>>> Spark prior to 1.5 often incompatible with Hive for example, if I remember
>>> correctly.
>>> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov <ale...@gmail.com> wrote:
>>>
>>>> I am trying to read an (array typed) parquet file in spark-shell (Spark
>>>> 1.4.1 with Hadoop 2.6):
>>>>
>>>> {code}
>>>> $ bin/spark-shell
>>>> log4j:WARN No appenders could be found for logger
>>>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>>>> log4j:WARN Please initialize the log4j system properly.
>>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
>>>> for more info.
>>>> Using Spark's default log4j profile:
>>>> org/apache/spark/log4j-defaults.properties
>>>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
>>>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to:
>>>> hivedata
>>>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
>>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>>> users with modify permissions: Set(hivedata)
>>>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
>>>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
>>>> server' on port 43731.
>>>> Welcome to
>>>>     __
>>>>  / __/__  ___ _/ /__
>>>> _\ \/ _ \/ _ `/ __/  '_/
>>>>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>>>>   /_/
>>>>
>>>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java
>>>> 1.8.0)
>>>> Type in expressions to have them evaluated.
>>>> Type :help for more information.
>>>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
>>>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
>>>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to:
>>>> hivedata
>>>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
>>>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>>>> users with modify permissions: Set(hivedata)
>>>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
>>>> 15/09/07 13:45:27 INFO Remoting: Starting remoting
>>>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on
>>>> addresses :[akka.tcp://sparkDriver@10.10.30.52:46083]
>>>> 15/09/07 13:45:27 INFO Utils: Successfully started service
>>>> 'sparkDriver' on port 46083.
>>>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
>>>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
>>>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
>>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
>>>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
>>>> 265.1 MB
>>>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
>>>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
>>>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
>>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
>>>> server' on port 38717.
>>>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
>>>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
>>>> 4040. Attempting port 4041.
>>>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
>>>> port 4041.
>>>> 15/09/07 13:45:27 INFO Spar

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
No, it was created in Hive by CTAS, but any help is appreciated...

On Mon, Sep 7, 2015 at 2:51 PM, Ruslan Dautkhanov <dautkha...@gmail.com>
wrote:

> That parquet table wasn't created in Spark, is it?
>
> There was a recent discussion on this list that complex data types in
> Spark prior to 1.5 often incompatible with Hive for example, if I remember
> correctly.
> On Mon, Sep 7, 2015, 2:57 PM Alex Kozlov <ale...@gmail.com> wrote:
>
>> I am trying to read an (array typed) parquet file in spark-shell (Spark
>> 1.4.1 with Hadoop 2.6):
>>
>> {code}
>> $ bin/spark-shell
>> log4j:WARN No appenders could be found for logger
>> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
>> log4j:WARN Please initialize the log4j system properly.
>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>> more info.
>> Using Spark's default log4j profile:
>> org/apache/spark/log4j-defaults.properties
>> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
>> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
>> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>> users with modify permissions: Set(hivedata)
>> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
>> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
>> server' on port 43731.
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>>   /_/
>>
>> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
>> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
>> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
>> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
>> disabled; ui acls disabled; users with view permissions: Set(hivedata);
>> users with modify permissions: Set(hivedata)
>> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
>> 15/09/07 13:45:27 INFO Remoting: Starting remoting
>> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
>> :[akka.tcp://sparkDriver@10.10.30.52:46083]
>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
>> on port 46083.
>> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
>> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
>> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
>> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
>> 265.1 MB
>> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
>> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
>> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
>> server' on port 38717.
>> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
>> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
>> 4040. Attempting port 4041.
>> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
>> port 4041.
>> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at
>> http://10.10.30.52:4041
>> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
>> localhost
>> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
>> http://10.10.30.52:43731
>> 15/09/07 13:45:27 INFO Utils: Successfully started service
>> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
>> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
>> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
>> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
>> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
>> localhost, 60973)
>> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
>> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
>> Spark context available as sc.
>> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
>> 0.13.1
>> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw st

Re: Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
The same error if I do:

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val results = sqlContext.sql("SELECT * FROM stats")

but it does work from Hive shell directly...

On Mon, Sep 7, 2015 at 1:56 PM, Alex Kozlov <ale...@gmail.com> wrote:

> I am trying to read an (array typed) parquet file in spark-shell (Spark
> 1.4.1 with Hadoop 2.6):
>
> {code}
> $ bin/spark-shell
> log4j:WARN No appenders could be found for logger
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> Using Spark's default log4j profile:
> org/apache/spark/log4j-defaults.properties
> 15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
> server' on port 43731.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.4.1
>   /_/
>
> Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
> Type in expressions to have them evaluated.
> Type :help for more information.
> 15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
> 15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
> 15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(hivedata);
> users with modify permissions: Set(hivedata)
> 15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
> 15/09/07 13:45:27 INFO Remoting: Starting remoting
> 15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriver@10.10.30.52:46083]
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver'
> on port 46083.
> 15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
> 15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
> 15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
> 15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity
> 265.1 MB
> 15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
> 15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
> server' on port 38717.
> 15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
> 4040. Attempting port 4041.
> 15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
> port 4041.
> 15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041
> 15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
> localhost
> 15/09/07 13:45:27 INFO Executor: Using REPL class URI:
> http://10.10.30.52:43731
> 15/09/07 13:45:27 INFO Utils: Successfully started service
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
> 15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
> 15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
> 15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
> manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
> localhost, 60973)
> 15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
> 15/09/07 13:45:28 INFO SparkILoop: Created spark context..
> Spark context available as sc.
> 15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
> 0.13.1
> 15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
> implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
> 15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
> 15/09/07 13:45:29 INFO Persistence: Property
> hive.metastore.integral.jdo.pushdown unknown - will be ignored
> 15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
> unknown - will be ignored
> 15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
> CLASSPATH (or one of dependencies)
> 15/09/07 13:

Parquet Array Support Broken?

2015-09-07 Thread Alex Kozlov
I am trying to read an (array typed) parquet file in spark-shell (Spark
1.4.1 with Hadoop 2.6):

{code}
$ bin/spark-shell
log4j:WARN No appenders could be found for logger
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties
15/09/07 13:45:22 INFO SecurityManager: Changing view acls to: hivedata
15/09/07 13:45:22 INFO SecurityManager: Changing modify acls to: hivedata
15/09/07 13:45:22 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hivedata);
users with modify permissions: Set(hivedata)
15/09/07 13:45:23 INFO HttpServer: Starting HTTP Server
15/09/07 13:45:23 INFO Utils: Successfully started service 'HTTP class
server' on port 43731.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.1
  /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0)
Type in expressions to have them evaluated.
Type :help for more information.
15/09/07 13:45:26 INFO SparkContext: Running Spark version 1.4.1
15/09/07 13:45:26 INFO SecurityManager: Changing view acls to: hivedata
15/09/07 13:45:26 INFO SecurityManager: Changing modify acls to: hivedata
15/09/07 13:45:26 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(hivedata);
users with modify permissions: Set(hivedata)
15/09/07 13:45:27 INFO Slf4jLogger: Slf4jLogger started
15/09/07 13:45:27 INFO Remoting: Starting remoting
15/09/07 13:45:27 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@10.10.30.52:46083]
15/09/07 13:45:27 INFO Utils: Successfully started service 'sparkDriver' on
port 46083.
15/09/07 13:45:27 INFO SparkEnv: Registering MapOutputTracker
15/09/07 13:45:27 INFO SparkEnv: Registering BlockManagerMaster
15/09/07 13:45:27 INFO DiskBlockManager: Created local directory at
/tmp/spark-f313315a-0769-4057-835d-196cfe140a26/blockmgr-bd1b8498-9f6a-47c4-ae59-8800563f97d0
15/09/07 13:45:27 INFO MemoryStore: MemoryStore started with capacity 265.1
MB
15/09/07 13:45:27 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-f313315a-0769-4057-835d-196cfe140a26/httpd-3fbe0c9d-c0c5-41ef-bf72-4f0ef59bfa21
15/09/07 13:45:27 INFO HttpServer: Starting HTTP Server
15/09/07 13:45:27 INFO Utils: Successfully started service 'HTTP file
server' on port 38717.
15/09/07 13:45:27 INFO SparkEnv: Registering OutputCommitCoordinator
15/09/07 13:45:27 WARN Utils: Service 'SparkUI' could not bind on port
4040. Attempting port 4041.
15/09/07 13:45:27 INFO Utils: Successfully started service 'SparkUI' on
port 4041.
15/09/07 13:45:27 INFO SparkUI: Started SparkUI at http://10.10.30.52:4041
15/09/07 13:45:27 INFO Executor: Starting executor ID driver on host
localhost
15/09/07 13:45:27 INFO Executor: Using REPL class URI:
http://10.10.30.52:43731
15/09/07 13:45:27 INFO Utils: Successfully started service
'org.apache.spark.network.netty.NettyBlockTransferService' on port 60973.
15/09/07 13:45:27 INFO NettyBlockTransferService: Server created on 60973
15/09/07 13:45:27 INFO BlockManagerMaster: Trying to register BlockManager
15/09/07 13:45:27 INFO BlockManagerMasterEndpoint: Registering block
manager localhost:60973 with 265.1 MB RAM, BlockManagerId(driver,
localhost, 60973)
15/09/07 13:45:27 INFO BlockManagerMaster: Registered BlockManager
15/09/07 13:45:28 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/09/07 13:45:28 INFO HiveContext: Initializing execution hive, version
0.13.1
15/09/07 13:45:28 INFO HiveMetaStore: 0: Opening raw store with
implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
15/09/07 13:45:29 INFO ObjectStore: ObjectStore, initialize called
15/09/07 13:45:29 INFO Persistence: Property
hive.metastore.integral.jdo.pushdown unknown - will be ignored
15/09/07 13:45:29 INFO Persistence: Property datanucleus.cache.level2
unknown - will be ignored
15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/09/07 13:45:29 WARN Connection: BoneCP specified but not present in
CLASSPATH (or one of dependencies)
15/09/07 13:45:36 INFO ObjectStore: Setting MetaStore object pin classes
with
hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
15/09/07 13:45:36 INFO MetaStoreDirectSql: MySQL check failed, assuming we
are not on mysql: Lexical error at line 1, column 5.  Encountered: "@"
(64), after : "".
15/09/07 13:45:37 INFO Datastore: The class
"org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as
"embedded-only" so does not have its own datastore table.
15/09/07 13:45:37 INFO Datastore: The class