[ https://issues.apache.org/jira/browse/CASSANDRA-9304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14945873#comment-14945873 ]
Tyler Hobbs commented on CASSANDRA-9304: ---------------------------------------- Overall the patch is definitely a nice improvement! Most of my review comments are about using more idiomatic python. Imports: * In the imports, StringIO is imported later. Also, to avoid flake8 warnings, add {{# noqa}} to the {{from io import StringIO}} line. ImportExportConfig: * The way that checks for unhandled options are performed is kind of strange. In perform_csv_import() they're checked immediately, but in perform_csv_export() it's deferred to ExportTask. Additionally, a lot of functions seem to modify the options inside an ImportExportConfig instance, which is somewhat unexpected behavior and seems likely to result in bugs in the future. * In Python it's pretty unusual to use a class to only store mutable data, so I would replace the ImportExportConfig class with a function that returns a tuple (e.g. {{(csv_options, dialect_options, unrecognized_options)}}) or a dict. perform_csv_import(): * Why do {{csv_options = config.csv_options}} if it's only used once a couple of lines later? * As a matter of code style, avoid using tuple unpacking unless you're actually unpacking a tuple or the assignments are very trivial (e.g. {{a, b = None, None}}) ExportTask: * The class should inherit from object * get_ranges(): ** There's no need to use a lambda sorting key on a list of tuples. Tuples are naturally sorted sanely. ** Instead of {{del hosts\[:\]}}, just say {{hosts = \[\]}}. You can also remove {{hosts = \[hostname\]}} above. ** Instead of doing {{for entry in ring}} and then using {{entry\[0\]}} and {{entry\[1\]}}, you can unpack the tuple in the loop: {{for token, replicas in ring:}} ** The final {{ranges.append()}} call could use a comment for explanation. If you make the suggested changes to {{hosts}}, just use {{ranges\[-1\]\[1\]}} for the hosts (unless it's empty). This approach is a little more obvious for readers. * In {{send_initial_work()}}, it's idiomatic in python to use {{_}} for the name of an unused variable, so the loop should be: {{for _ in xrange(num_parallel_jobs)}}. ExportProcess: * In {{start_job()}}, it's not clear what would result in an IndexError and why we're ignoring it. Try to limit try/excepts to the minimum number of lines they need to contain, and add a comment explaining what's going on there. * ExportSession should be a separate top-level class instead of a nested one. (It's pretty unusual to nest classes.) Also, it should inherit from object(). * I would rename {{handle_result}} to {{attach_callbacks}} and rename {{return_result}} to {{write_rows_to_csv}}. * The host selection code in {{get_session}} is over-complicated. How about something like: {code} new_hosts = [h for h in hosts if h not in self.hosts_to_sessions] if new_hosts: host = new_hosts[0] # open Cluster, etc else: host = min(hosts, key=lambda h: self.hosts_to_sessions[h].jobs) session = self.hosts_to_sessions[host] session.jobs += 1 return session {code} General: * Try to add a few more code comments explaining what's going on at a high level. Documenting method return values is also good (since there's no type information about that). * You may want to put the csv-related classes and functions into separate modules under {{pylib}}. > COPY TO improvements > -------------------- > > Key: CASSANDRA-9304 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9304 > Project: Cassandra > Issue Type: Improvement > Components: Core > Reporter: Jonathan Ellis > Assignee: Stefania > Priority: Minor > Labels: cqlsh > Fix For: 2.1.x > > > COPY FROM has gotten a lot of love. COPY TO not so much. One obvious > improvement could be to parallelize reading and writing (write one page of > data while fetching the next). -- This message was sent by Atlassian JIRA (v6.3.4#6332)